Data Science Capstone - Milestone Report

1. Executive summary

The goal of the Capstone Project is to create a predictive text product, that implements a predictive text model and algorithm built from analyzing a large corpus of text documents, to provide an interface that can be accessed by others.

During the last few weeks the following tasks have been accomplished:

Data has been downloaded.
A preliminary Exploratory Analysis has been performed.
Tokenization has been carried out, using different R packages and regular expressions.
A set of resources about Natural Language Processing has been gathered and studied.
An in-depth Exploratory Analysis was carried out, using that previously learned and using some R packages, and conclusions have been derived.
First conclusions to set up a prediction algorithm has been outlined.
Based on those conclussions, a first prediction algorithm (simple back off) has been implemented and it’s being tested.

This report is organized as follows:

The Data
Exploring the Data
Resources about Natural Language Processing
Guidelines for implementing a prediction algorithm
Implementing a first simple back off prediction algorithm

The R code used can be found at github

2. The Data

2.1 Downloading

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. It will be downloaded from Coursera (Capstone Dataset)

The files are named LOCALE.[blogs|news|twitter].txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI.

Compressed data files were downloaded. We had to be careful with files codification, and chose to read them as “UTF-8”.

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

2.2 First Exploratory Analysis

Next, we need to get familiar with the data.

We are going to work with the “en_US” text files using a representative sample subset of the data, randomly selecting rows to be included in the samples.

To accomplish this we first made some first statistics about the size and composition of the files.

File	NLines	NChars
Blogs	899288	206824382
News	1010242	203223154
Twitter	2360148	162096241

Simply at first glance, we realize that (A) we’re going to work with very large text files; and (B) the diversity of the texts (punctuation, emoticons, hashtags misspelling, synonyms) is really huge.

We can think in samples with 10.000 lines (more or less, and just as a rule of thumb, the margin of error would be $1/\sqrt{10000} = 1\%$). 10.000 is 0.42 % of 2360148 (the number of lines in twitters file). Finally we used a sample size ten times larger.

We made the sampling using a customized function and obtained 3 sampled text files containing 5% random lines of each blogs, news and twitters data text files.

2.3 Tokenization and cleaning

As said before, reviewing the lines in the sample we realized we needed to perform “a bit”" of cleaning before tokenization to deal with punctuation, white spaces, upper/lower case characters, emoticons, non-printable characters an so on. This has been done with the a special-purpose function (extensively using regular expressions and gsub()), that is later employed by the tm package tools to build 1, 2 and 3-grams Term-Document Matrixes (TDM) from the sampled data files, as the starting point for our in-depth exploration. As tokenizer we used NGramTokenizer() from RWekapackage.

Numbers, hashtags, twitter users and e-mail adresses have been labeled.
Ten types of emoticons have been considered.
Non printable and most of puntuaction characters have been removed (except thoe joining words and those finishing phrases).
Whitespaces have been stripped.
Everything have been converted to lowercase.

3. Exploring the Data

3.1 Terms Frequencies

Now we have 1, 2 and 3-Grams TDMs. As an example, see a 2-Wordgrams TDM:

##                  Docs
## Terms             sample_blogs.txt sample_news.txt sample_twitter.txt
##   a haircut                      1               3                 15
##   a hairdo                       0               0                  1
##   a hairdresser                  0               1                  1
##   a hairpin                      1               0                  1
##   a hairspiration                1               0                  0
##   a hairstylist                  0               1                  0
##   a hairy                        0               1                  0
##   a haitian                      1               2                  0
##   a haka                         0               1                  0
##   a hal-lie                      0               1                  0
##   a half                        91              76                 46

Each TDM shows, by file, a Term Frequency List. Each entry in this table shows the count of each term (or type) instances, i.e., the count of its tokens.

Let’s begin our exploratory data analysis with some stats from the 1-grams TDMs. We’ve obtained the % of corpus coverage by trial and error. Below you can see an example for a 95% tokens coverage:

## 
## SAMPLE_BLOGS.TXT term counts statistics:
## 
## Centiles:
## 
##   50%   51%   52%   53%   54%   55%   56%   57%   58%   59%   60%   61% 
##     1     1     1     2     2     2     2     2     2     2     2     2 
##   62%   63%   64%   65%   66%   67%   68%   69%   70%   71%   72%   73% 
##     2     2     2     2     3     3     3     3     3     3     3     4 
##   74%   75%   76%   77%   78%   79%   80%   81%   82%   83%   84%   85% 
##     4     4     4     5     5     5     6     6     7     7     8     9 
##   86%   87%   88%   89%   90%   91%   92%   93%   94%   95%   96%   97% 
##    10    11    13    14    17    19    23    27    33    42    55    78 
##   98%   99%  100% 
##   124   250 91763 
## 
## All Terms (frequencies):
##   Number of Types:              73779
##   Number of Tokens:             1881143
##   Tokens / Types:               25.50
##   Frequency Mean:               25.50
##   Frequency Variance:           359013.40
##   Frequency Standard Deviation: 599.18
##   Frequency Standard Error:     2.21
## 
## Frequent Terms (count >= 6):
##   Number of Types:              15407
##   Number of Tokens:             1785871
##   Tokens / Types:               115.91
##   Frequency Mean:               115.91
##   Frequency Variance:           1708946.23
##   Frequency Standard Deviation: 1307.27
##   Frequency Standard Error:     10.53
## 
##   Frequent Terms Percentage:    20.88 %
##   Frequent Terms Coverage:      94.94 %
## 
## --------------------

## 
## SAMPLE_NEWS.TXT term counts statistics:
## 
## Centiles:
## 
##     50%     51%     52%     53%     54%     55%     56%     57%     58% 
##     2.0     2.0     2.0     2.0     2.0     2.0     2.0     2.0     2.0 
##     59%     60%     61%     62%     63%     64%     65%     66%     67% 
##     2.0     2.0     2.0     2.0     2.0     3.0     3.0     3.0     3.0 
##     68%     69%     70%     71%     72%     73%     74%     75%     76% 
##     3.0     3.0     3.0     4.0     4.0     4.0     4.0     5.0     5.0 
##     77%     78%     79%     80%     81%     82%     83%     84%     85% 
##     5.0     6.0     6.0     6.0     7.0     8.0     8.0     9.0    10.0 
##     86%     87%     88%     89%     90%     91%     92%     93%     94% 
##    11.0    12.0    14.0    16.0    18.0    21.0    25.0    30.0    37.0 
##     95%     96%     97%     98%     99%    100% 
##    46.0    61.0    86.0   133.0   258.8 97940.0 
## 
## All Terms (frequencies):
##   Number of Types:              74621
##   Number of Tokens:             1711106
##   Tokens / Types:               22.93
##   Frequency Mean:               22.93
##   Frequency Variance:           295802.42
##   Frequency Standard Deviation: 543.88
##   Frequency Standard Error:     1.99
## 
## Frequent Terms (count >= 5):
##   Number of Types:              18704
##   Number of Tokens:             1625034
##   Tokens / Types:               86.88
##   Frequency Mean:               86.88
##   Frequency Variance:           1174712.74
##   Frequency Standard Deviation: 1083.84
##   Frequency Standard Error:     7.92
## 
##   Frequent Terms Percentage:    25.07 %
##   Frequent Terms Coverage:      94.97 %
## 
## --------------------

## 
## SAMPLE_TWITTER.TXT term counts statistics:
## 
## Centiles:
## 
##   50%   51%   52%   53%   54%   55%   56%   57%   58%   59%   60%   61% 
##     1     1     1     1     1     1     1     1     1     1     1     1 
##   62%   63%   64%   65%   66%   67%   68%   69%   70%   71%   72%   73% 
##     1     1     1     1     2     2     2     2     2     2     2     2 
##   74%   75%   76%   77%   78%   79%   80%   81%   82%   83%   84%   85% 
##     2     2     3     3     3     3     3     4     4     5     5     5 
##   86%   87%   88%   89%   90%   91%   92%   93%   94%   95%   96%   97% 
##     6     7     8     9    10    12    14    17    22    28    37    54 
##   98%   99%  100% 
##    89   204 48239 
## 
## All Terms (frequencies):
##   Number of Types:              78110
##   Number of Tokens:             1561577
##   Tokens / Types:               19.99
##   Frequency Mean:               19.99
##   Frequency Variance:           181152.63
##   Frequency Standard Deviation: 425.62
##   Frequency Standard Error:     1.52
## 
## Frequent Terms (count >= 5):
##   Number of Types:              13285
##   Number of Tokens:             1474410
##   Tokens / Types:               110.98
##   Frequency Mean:               110.98
##   Frequency Variance:           1055185.37
##   Frequency Standard Deviation: 1027.22
##   Frequency Standard Error:     8.91
## 
##   Frequent Terms Percentage:    17.01 %
##   Frequent Terms Coverage:      94.42 %
## 
## --------------------

As you can see, this stats show a very skewed distribution of the types:

Most of terms are on the last 1% (less frequent)
With just a few tokens we can cover a lot of terms:

Coverage:	50%	90%	95%
Blogs	0.14%	9.0%	20.9%
News	0.26%	11.1%	25.0%
Twitter	0.16%	7.3%	17.0%
Total	0.09%	4.6%	11.8%

This means that, choosing as dictionary the set of Corpus Terms that appear at least 50 times (cut off, 4.6% of terms), we have a tokens coverage of 90%. Or choosing as dictionary the set of Corpora Terms that appears at least 12 times, we have a tokens coverage of 95%.

2G and 3G are different: e.g., if i choose even count cut off = 2 for 2G, coverage falls fromm 100% (count = 1) to 76.48%.

These are the the first 20 more frequent terms in US_en_sample Corpus:

## Words in dictionary: 7252

##      sample_blogs.txt sample_news.txt sample_twitter.txt  TOTAL
## the             91763           97940              45497 235200
## to              52797           44883              38969 136649
## and             54420           43967              21554 119941
## a               44596           43449              29975 118020
## of              43351           38301              17799  99451
## i               41195            7948              35823  84966
## in              29031           33289              18448  80768
## !                8185             496              48239  56920
## for             18308           17490              18971  54769
## unk             13366           25517              14743  53626
## is              21489           14299              17830  53618
## that            23431           17529              11614  52574
## it              21884           11344              14415  47643
## you             15729            4855              26924  47508
## on              13487           13266              13523  40276
## with            14364           12730               8564  35658
## was             13821           11131               5806  30758
## my              13771            2137              14342  30250
## at               8359           10747               9027  28133
## ?                6391            2410              19011  27812

Let’s use our 90% dictionary to make some plots by frequency rank:

The more frquent type is “the”: more than 4% of all the tokens are the term type “the”.

What we are seeing is the expression of Zipf’s Law: Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. […] Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

We can use tm function Zipf_plot() from tm package to explore the fit of this Law to our data:

## (Intercept)           x 
##   15.282878   -1.319597

As shown, terms frequency fits quite well for terms appearing more than 4 times.

3.2 Terms Spectrum

Beyond terms frequencies, we can now explore the frequency of frequencies. Some explanation is needed here:

Let V = {w1,w2,w3,w4,w5} be our VOCABULARY. |V| = Card(V) = 5 TERMS or TYPE TERMS.

Let T = (w1,w2,w1,w3,w3,w4,w3,w5,w1,w5) be our TEXT. N = 10 TOKENS.

The counts frequencies and counts densities are:

c(w1) = 3; %c(w1) = c(w1) / N_t = 3/10
c(w2) = 1; %c(w1) = c(w2) / N_t = 1/10
c(w3) = 3; %c(w1) = c(w3) / N_t = 3/10
c(w4) = 1; %c(w1) = c(w4) / N_t = 1/10
c(w5) = 2; %c(w1) = c(w5) / N_t = 2/10

For example, “c(w1) = 3; %; c(w1) = 3/10”" means that 30% of the corpus tokens are instanced by the term w1. This counts is what we’ve been talking about in the last section.

But now we’ll wonder how many times a term appears “c” times.

We can compute frequencies of frecuencies and densities of frequencies:

V_m=1 = 2; %V_m=1 = 2/5
V_m=2 = 1; %V_m=1 = 1/5
V_m=3 = 2; %V_m=1 = 2/5

For example, V_m=1 = 2; %; V_m=1 = 2/5 means that 40% of the vocabulary terms appear twice in the text.

This is also known as frequency spectrum.

Here we have a 1-grams spectrum plot using Heaps_plot() function (package tm):

## (Intercept)           x 
##   0.4885902   0.7411750

The red line is the observed spectrum while the black one is the spectrum as predicted by the Heap’s Law: Heaps’ law is an empirical law which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as:

\[ V_R(n) = Kn^\beta \]

It simply means that the vocabulary length |V| grows with the document length in a particular manner.

Or, with zipfR package tools:

And, summing up:

Frequency spectrum shape (HEap’ Law) can be derived from Zipf Law. zipfR allows us to fit a finite Zipf-Mandelbrot model for the spectrum distribution (see “LNRE modelling”):

## finite Zipf-Mandelbrot LNRE model.
## Parameters:
##    Shape:          alpha = 0.6243447 
##    Lower cutoff:       A = 3.920857e-39 
##    Upper cutoff:       B = 0.004740673 
##  [ Normalization:      C = 2.804611 ]
## Population size: S = 4.279724e+24 
## Sampling method: Poisson, with exact calculations.
## 
## Parameters estimated from sample of size N = 1561577:
##                  V      V1      V2      V3      V4      V5    
##    Observed: 78110 50798.0 7979.00 3781.00 2267.00 1604.00 ...
##    Expected: 78110 48846.8 9174.78 4207.11 2498.66 1686.92 ...
## 
## Goodness-of-fit (multivariate chi-squared test):
##          X2 df            p
##    472.5227 13 9.945421e-93

(predictions are red bars, observation the black ones)

Although it appears to fit quite well, let’s be careful with the low p-value produced by the model.

Then, from the spectrum model we can derive a frequency Zipf-Mandelbrot model (see “LNRE modelling”):

The blue line is the observed terms frequency, the red one is the predicted one.

As announced by the low p-value, this model doesn’t works fine. It’s a pity because, with such a model at disposal, we’d probably be able to save a lot of RAM resources (no need to store frequencies, just store the ranked list of terms and calculate frequencies from the model).

3.3 RAM Estimations (5% sample)

From the sizes of the TDMs obtained some estimations of RAM resources needed have been done:

Twitter - 95% coverage

## RAM for 1-Wordgrams table: 2.718 Mbytes

## RAM for 2-Wordgrams table: 27.181 Mbytes

## RAM for 3-Wordgrams table: 54.363 Mbytes

## Total RAM needed: 84.262 Mbytes

3.4 Exploratory Analysis Conclusions

The text files are really large. We should intensively work with token samples.

To get a statistically significant sample:

We need to make N samples each with size n. Let’s say N = 100 samples and n = 1000 tokens to have margin errors in the order of $1/ = 3$“.

Let’s take the twitter file as example. From our Exploratory Analysis we know this file has 1,888,118 lines and we extracted a 5% sample (94,406 lines) were we found 1,561,577 tokens, i.e., there are 13.23 tokens per line in average.

So, to get a sample of 1,000 tokens i need 1,000 / 13.23 = 75.6 lines in average. As i want 100 samples, i need a sample of 7,560 lines.

Of course, if i’m sampling 2-grams i’ll need 15,120 lines; 22,680 lines for 3-grams and 30,240 for 4-grams.

However, when i extract 1,000 tokens from 75.6 lines each token extraction is NOT independent from the previous ones. So we must be cautious and take action in anticipating that something goes wrong. So in each sample we’ll extract double the number of the lines calculated for 4-grams by the procedure outlined above.

These are the resulting numbers (% of each file i need to sample for N-grams):

File	1-grams	2-grams	3-grams	4-grams
Blogs	0.40%	0.70%	1.10%	1.40%
News	0.40%	0.80%	1.20%	1.50%
Twitter	0.50%	0.90%	1.30%	1.70%

Terms frequency distribution allows to manage 95% of types with 21, 25, 17 and 12 % of terms (respectively for Blogs, News, Twitter and total corpus). We can use reduced dictionaries for efficiency sake.
- However, the not so good results of fitting terms frequency dstribution by a Zipfs- Mandelbrott model indicate table frequency lists can’t be substituted by probability calculations based on that model.
We must be careful with the last two conclusions, as vocabulary (number of types) grows with the length (number of tokens) of text. When working with samples we can choose reduced dictionaries to cover, for instance, 95% of tokens; but the same dictionaries will cover lower tokens % when working with larger samples.
RAM considerations: with 5% samples we need RAM allocation in the order of 50 - 100 Mbytes. Appropiated data structures and tools for fast search in tables must be considered. (e.g., data.table and ff packages)

4. Resources about Natural Language Processing

During the analysis, the following resources were studied:

CRAN Task View: Natural Language Processing: a comprehensive index of tools and resources in R to work with NPL
R Programming/Text Processing: This page includes all the material you need to deal with strings in R, specially regular expressions
Natural language processing: Coursera / Standford NLP course. Extremely important material about basic text processing in Week 1 and language modelling in Week 2 materials. A PDF chapter with the language modelling issues can be found at:
- Speech and Language Processing. Daniel Jurafsky & James H. Martin. Draft of September 1, 2014.
Packages
- tm: Introduction to the tm Package - Text Mining in R
- zipfR: zipfR: user-friendly LNRE modelling in R
- data.table: Introduction to the data.table package in R

5. Guidelines for implementing a prediction algorithm

From the above analysis and study, the following guidelines have been defined in a work plan (main reference: Speech and Language Processing.:

1- Divide corpus (en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt) into 3 disjoint and randomized sets:

Training set
Devset
Test set

2- Training

Add seudowords for 2 and 3-wordgrams at the extremes of sentences.
Deal with zeros (things that don’t occur in the training set but do occur in the testing set, i.e., unknown words or out of vocabulary (OOV) words.) Open vocabulary system strategy where we model these potential unknown words in the test set by adding a pseudo-word called $<UNK>$.
Smoothing: what do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context (for example they appear after a word they never appeared after in training). To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. This modification is called smoothing or discounting.
Backoff strategy

3- Predict

I see two posibilities, the first easier to implement than the second:

A - Simple back off algorithm

From N-wordgrams TDMs we can build N-wordgrams count tables. They will be very sparse (many of the cells will be zero, i.e., not seen N-wordrams). We can transform these N-wordgrams count tables to N-wordgrams probability tables.

(Perhaps it’s possible to use directly the TDMs instead of building so sparse N-wordgrams count tables)

The algorithm is as follows:

Take as input N-1 words
Use them to:
- Return the most frequent word (or sample one if 2 or more have the same frequency)
- Or sample 1 word from the N-1 bigram row in the N-wordgram probability table taking the probabilities in this row as weigths:

Next word = sample(colnames, size = 1, prob = probabilities in the row of N-1 bigram)

Back off: If the N-1 bigram is not found (or if its probability is very low ???) in the N-wordgram probability table, repeat the process with the N-2 wordgram probability table and with the N-2 last input words (and so on)

If we arrive to the unigram probability table and no result is found, return “Next word probably not in my training dictionary”.

If we can work just with the TDMs, as they are not too big, we could use N=4 or even N=5.

B - Smoothing + back off/Interpolation Kneser-Ney algorithm

See video and slides about Kneser-Ney Smoothing.

With this algorithm we end up with a smoothed N-wordgram probability table (no zeros). The probabilities of the original zeros is obtaining discounting from the original seen N-grams probabilities following an elegant backoff/interpolation strategy.

Difficulties can arise from the fact that this N-wordgram probability table could be really huge. A RAM estimation is needed to choose N.

In this case we just have to sample 1 word from the N-1 bigram row in the N-wordgram probability table taking the probabilities in this row as weigths:

Next word = sample(colnames, size = 1, prob = probabilities in the row of N-1 bigram)

Only in case no result is found we should return “Next word probably not in my training dictionary”.

4- Test and Evaluate model

Accuracy and perplexity measures of N-grams models trained on the training set will be taken on the test set.

This cycle will be developed completely with the easier prediction algorithm in order to build a shiny application as soon as possible. If time is left, the second prediction algorithm will be implemented and compared to the first one.

6. Implementing a first simple back off prediction algorithm

This is the current situation of the development of the algorithm:

Training and testing sample files have been created for each blogs, news and twitter data files. Training files = 80 % and testing files = 20 % of each data file.
Training. For each training file a sample has been created.
- Determining dictionaries Each sample has been tokenized, after processing punctuation, in order to obtain 1-grams TDMs. Stats has been calculated to determine how many counts are necessary to retain a token in a dictionary (criteria: 95% token coverage). Based on that, dictionaries have been created.
- After that, all tokens not in the dictionary have been reemplaced in the whole corpus by the label $<UNK>$.
- Next, new TDMs have been calculated, now for 1,2,3 and 4-grams. In the case of 2, 3 and 4-grams, pseudo-words have been added at the beggining and at the end oeach line.
- From these TDMs, Table Frequency Lists (TFLs) have been derived and stored as data.table’s. Now we have a model (4 TFLs each: 1, 2, 3 and 4-grams) for Blogs, News and Twitter documents and for the whole Corpus. Blogs, News and Twitter models use 40 Mbytes data tables and Corpus model, 70 Mbytes.
- A prediction function have been implemented, following a simple back off algorithm beginning with 4-grams.
Testing. We have built 4-gram TFLs (for Blogs, News, Twitter and whole Corpus) from the test set. The prediction function, with the appropiated model, is feeded by the first three tokens of each testing 4-gram and the prediction is compared with the fourth token. Accuracy will be measured carrying out 100 times 1000 predictions, to calculate an average and a confidence interval. The models and prediction function are been tested nowadays.
Evaluation. The four models (Blogs, News, Twitter and Corpus) accuracies will be evaluated against its respective testing sets in the two modes explained above (with and without final weighted sampling).
Further work After implementing this models in a Shiny application, a new model (Kneser-Ney algorithm) will be implemented, evaluated and compared with the actual models before adding it to the Shiny application.

Data Science Capstone - Milestone Report

Friday, March 27, 2015

1. Executive summary

Twitter - 95% coverage