The goal of the Capstone Project is to create a predictive text product, that implements a predictive text model and algorithm built from analyzing a large corpus of text documents, to provide an interface that can be accessed by others.
During the last few weeks the following tasks have been accomplished:
This report is organized as follows:
The R code used can be found at github
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. It will be downloaded from Coursera (Capstone Dataset)
The files are named LOCALE.[blogs|news|twitter].txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI.
Compressed data files were downloaded. We had to be careful with files codification, and chose to read them as “UTF-8”.
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
Next, we need to get familiar with the data.
We are going to work with the “en_US” text files using a representative sample subset of the data, randomly selecting rows to be included in the samples.
To accomplish this we first made some first statistics about the size and composition of the files.
| File | NLines | NChars |
|---|---|---|
| Blogs | 899288 | 206824382 |
| News | 1010242 | 203223154 |
| 2360148 | 162096241 |
Simply at first glance, we realize that (A) we’re going to work with very large text files; and (B) the diversity of the texts (punctuation, emoticons, hashtags misspelling, synonyms) is really huge.
We can think in samples with 10.000 lines (more or less, and just as a rule of thumb, the margin of error would be \(1/\sqrt{10000} = 1\%\)). 10.000 is 0.42 % of 2360148 (the number of lines in twitters file). Finally we used a sample size ten times larger.
We made the sampling using a customized function and obtained 3 sampled text files containing 5% random lines of each blogs, news and twitters data text files.
As said before, reviewing the lines in the sample we realized we needed to perform “a bit”" of cleaning before tokenization to deal with punctuation, white spaces, upper/lower case characters, emoticons, non-printable characters an so on. This has been done with the a special-purpose function (extensively using regular expressions and gsub()), that is later employed by the tm package tools to build 1, 2 and 3-grams Term-Document Matrixes (TDM) from the sampled data files, as the starting point for our in-depth exploration. As tokenizer we used NGramTokenizer() from RWekapackage.
Now we have 1, 2 and 3-Grams TDMs. As an example, see a 2-Wordgrams TDM:
## Docs
## Terms sample_blogs.txt sample_news.txt sample_twitter.txt
## a haircut 1 3 15
## a hairdo 0 0 1
## a hairdresser 0 1 1
## a hairpin 1 0 1
## a hairspiration 1 0 0
## a hairstylist 0 1 0
## a hairy 0 1 0
## a haitian 1 2 0
## a haka 0 1 0
## a hal-lie 0 1 0
## a half 91 76 46
Each TDM shows, by file, a Term Frequency List. Each entry in this table shows the count of each term (or type) instances, i.e., the count of its tokens.
Let’s begin our exploratory data analysis with some stats from the 1-grams TDMs. We’ve obtained the % of corpus coverage by trial and error. Below you can see an example for a 95% tokens coverage:
##
## SAMPLE_BLOGS.TXT term counts statistics:
##
## Centiles:
##
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61%
## 1 1 1 2 2 2 2 2 2 2 2 2
## 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73%
## 2 2 2 2 3 3 3 3 3 3 3 4
## 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85%
## 4 4 4 5 5 5 6 6 7 7 8 9
## 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97%
## 10 11 13 14 17 19 23 27 33 42 55 78
## 98% 99% 100%
## 124 250 91763
##
## All Terms (frequencies):
## Number of Types: 73779
## Number of Tokens: 1881143
## Tokens / Types: 25.50
## Frequency Mean: 25.50
## Frequency Variance: 359013.40
## Frequency Standard Deviation: 599.18
## Frequency Standard Error: 2.21
##
## Frequent Terms (count >= 6):
## Number of Types: 15407
## Number of Tokens: 1785871
## Tokens / Types: 115.91
## Frequency Mean: 115.91
## Frequency Variance: 1708946.23
## Frequency Standard Deviation: 1307.27
## Frequency Standard Error: 10.53
##
## Frequent Terms Percentage: 20.88 %
## Frequent Terms Coverage: 94.94 %
##
## --------------------
##
## SAMPLE_NEWS.TXT term counts statistics:
##
## Centiles:
##
## 50% 51% 52% 53% 54% 55% 56% 57% 58%
## 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
## 59% 60% 61% 62% 63% 64% 65% 66% 67%
## 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
## 68% 69% 70% 71% 72% 73% 74% 75% 76%
## 3.0 3.0 3.0 4.0 4.0 4.0 4.0 5.0 5.0
## 77% 78% 79% 80% 81% 82% 83% 84% 85%
## 5.0 6.0 6.0 6.0 7.0 8.0 8.0 9.0 10.0
## 86% 87% 88% 89% 90% 91% 92% 93% 94%
## 11.0 12.0 14.0 16.0 18.0 21.0 25.0 30.0 37.0
## 95% 96% 97% 98% 99% 100%
## 46.0 61.0 86.0 133.0 258.8 97940.0
##
## All Terms (frequencies):
## Number of Types: 74621
## Number of Tokens: 1711106
## Tokens / Types: 22.93
## Frequency Mean: 22.93
## Frequency Variance: 295802.42
## Frequency Standard Deviation: 543.88
## Frequency Standard Error: 1.99
##
## Frequent Terms (count >= 5):
## Number of Types: 18704
## Number of Tokens: 1625034
## Tokens / Types: 86.88
## Frequency Mean: 86.88
## Frequency Variance: 1174712.74
## Frequency Standard Deviation: 1083.84
## Frequency Standard Error: 7.92
##
## Frequent Terms Percentage: 25.07 %
## Frequent Terms Coverage: 94.97 %
##
## --------------------
##
## SAMPLE_TWITTER.TXT term counts statistics:
##
## Centiles:
##
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61%
## 1 1 1 1 1 1 1 1 1 1 1 1
## 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73%
## 1 1 1 1 2 2 2 2 2 2 2 2
## 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85%
## 2 2 3 3 3 3 3 4 4 5 5 5
## 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97%
## 6 7 8 9 10 12 14 17 22 28 37 54
## 98% 99% 100%
## 89 204 48239
##
## All Terms (frequencies):
## Number of Types: 78110
## Number of Tokens: 1561577
## Tokens / Types: 19.99
## Frequency Mean: 19.99
## Frequency Variance: 181152.63
## Frequency Standard Deviation: 425.62
## Frequency Standard Error: 1.52
##
## Frequent Terms (count >= 5):
## Number of Types: 13285
## Number of Tokens: 1474410
## Tokens / Types: 110.98
## Frequency Mean: 110.98
## Frequency Variance: 1055185.37
## Frequency Standard Deviation: 1027.22
## Frequency Standard Error: 8.91
##
## Frequent Terms Percentage: 17.01 %
## Frequent Terms Coverage: 94.42 %
##
## --------------------
As you can see, this stats show a very skewed distribution of the types:
| Coverage: | 50% | 90% | 95% |
|---|---|---|---|
| Blogs | 0.14% | 9.0% | 20.9% |
| News | 0.26% | 11.1% | 25.0% |
| 0.16% | 7.3% | 17.0% | |
| Total | 0.09% | 4.6% | 11.8% |
This means that, choosing as dictionary the set of Corpus Terms that appear at least 50 times (cut off, 4.6% of terms), we have a tokens coverage of 90%. Or choosing as dictionary the set of Corpora Terms that appears at least 12 times, we have a tokens coverage of 95%.
2G and 3G are different: e.g., if i choose even count cut off = 2 for 2G, coverage falls fromm 100% (count = 1) to 76.48%.
These are the the first 20 more frequent terms in US_en_sample Corpus:
## Words in dictionary: 7252
## sample_blogs.txt sample_news.txt sample_twitter.txt TOTAL
## the 91763 97940 45497 235200
## to 52797 44883 38969 136649
## and 54420 43967 21554 119941
## a 44596 43449 29975 118020
## of 43351 38301 17799 99451
## i 41195 7948 35823 84966
## in 29031 33289 18448 80768
## ! 8185 496 48239 56920
## for 18308 17490 18971 54769
## unk 13366 25517 14743 53626
## is 21489 14299 17830 53618
## that 23431 17529 11614 52574
## it 21884 11344 14415 47643
## you 15729 4855 26924 47508
## on 13487 13266 13523 40276
## with 14364 12730 8564 35658
## was 13821 11131 5806 30758
## my 13771 2137 14342 30250
## at 8359 10747 9027 28133
## ? 6391 2410 19011 27812
Let’s use our 90% dictionary to make some plots by frequency rank:
The more frquent type is “the”: more than 4% of all the tokens are the term type “the”.
What we are seeing is the expression of Zipf’s Law: Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. […] Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
We can use tm function Zipf_plot() from tm package to explore the fit of this Law to our data:
## (Intercept) x
## 15.282878 -1.319597
As shown, terms frequency fits quite well for terms appearing more than 4 times.
Beyond terms frequencies, we can now explore the frequency of frequencies. Some explanation is needed here:
Let V = {w1,w2,w3,w4,w5} be our VOCABULARY. |V| = Card(V) = 5 TERMS or TYPE TERMS.
Let T = (w1,w2,w1,w3,w3,w4,w3,w5,w1,w5) be our TEXT. N = 10 TOKENS.
The counts frequencies and counts densities are:
For example, “c(w1) = 3; %; c(w1) = 3/10”" means that 30% of the corpus tokens are instanced by the term w1. This counts is what we’ve been talking about in the last section.
But now we’ll wonder how many times a term appears “c” times.
We can compute frequencies of frecuencies and densities of frequencies:
For example, V_m=1 = 2; %; V_m=1 = 2/5 means that 40% of the vocabulary terms appear twice in the text.
This is also known as frequency spectrum.
Here we have a 1-grams spectrum plot using Heaps_plot() function (package tm):
## (Intercept) x
## 0.4885902 0.7411750
The red line is the observed spectrum while the black one is the spectrum as predicted by the Heap’s Law: Heaps’ law is an empirical law which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as:
\[ V_R(n) = Kn^\beta \]
It simply means that the vocabulary length |V| grows with the document length in a particular manner.
Or, with zipfR package tools:
And, summing up:
Frequency spectrum shape (HEap’ Law) can be derived from Zipf Law. zipfR allows us to fit a finite Zipf-Mandelbrot model for the spectrum distribution (see “LNRE modelling”):
## finite Zipf-Mandelbrot LNRE model.
## Parameters:
## Shape: alpha = 0.6243447
## Lower cutoff: A = 3.920857e-39
## Upper cutoff: B = 0.004740673
## [ Normalization: C = 2.804611 ]
## Population size: S = 4.279724e+24
## Sampling method: Poisson, with exact calculations.
##
## Parameters estimated from sample of size N = 1561577:
## V V1 V2 V3 V4 V5
## Observed: 78110 50798.0 7979.00 3781.00 2267.00 1604.00 ...
## Expected: 78110 48846.8 9174.78 4207.11 2498.66 1686.92 ...
##
## Goodness-of-fit (multivariate chi-squared test):
## X2 df p
## 472.5227 13 9.945421e-93
(predictions are red bars, observation the black ones)
Although it appears to fit quite well, let’s be careful with the low p-value produced by the model.
Then, from the spectrum model we can derive a frequency Zipf-Mandelbrot model (see “LNRE modelling”):
The blue line is the observed terms frequency, the red one is the predicted one.
As announced by the low p-value, this model doesn’t works fine. It’s a pity because, with such a model at disposal, we’d probably be able to save a lot of RAM resources (no need to store frequencies, just store the ranked list of terms and calculate frequencies from the model).
From the sizes of the TDMs obtained some estimations of RAM resources needed have been done:
## RAM for 1-Wordgrams table: 2.718 Mbytes
## RAM for 2-Wordgrams table: 27.181 Mbytes
## RAM for 3-Wordgrams table: 54.363 Mbytes
## Total RAM needed: 84.262 Mbytes
To get a statistically significant sample:
We need to make N samples each with size n. Let’s say N = 100 samples and n = 1000 tokens to have margin errors in the order of $1/ = 3$“.
Let’s take the twitter file as example. From our Exploratory Analysis we know this file has 1,888,118 lines and we extracted a 5% sample (94,406 lines) were we found 1,561,577 tokens, i.e., there are 13.23 tokens per line in average.
So, to get a sample of 1,000 tokens i need 1,000 / 13.23 = 75.6 lines in average. As i want 100 samples, i need a sample of 7,560 lines.
Of course, if i’m sampling 2-grams i’ll need 15,120 lines; 22,680 lines for 3-grams and 30,240 for 4-grams.
However, when i extract 1,000 tokens from 75.6 lines each token extraction is NOT independent from the previous ones. So we must be cautious and take action in anticipating that something goes wrong. So in each sample we’ll extract double the number of the lines calculated for 4-grams by the procedure outlined above.
These are the resulting numbers (% of each file i need to sample for N-grams):
| File | 1-grams | 2-grams | 3-grams | 4-grams |
|---|---|---|---|---|
| Blogs | 0.40% | 0.70% | 1.10% | 1.40% |
| News | 0.40% | 0.80% | 1.20% | 1.50% |
| 0.50% | 0.90% | 1.30% | 1.70% |
Terms frequency distribution allows to manage 95% of types with 21, 25, 17 and 12 % of terms (respectively for Blogs, News, Twitter and total corpus). We can use reduced dictionaries for efficiency sake.
We must be careful with the last two conclusions, as vocabulary (number of types) grows with the length (number of tokens) of text. When working with samples we can choose reduced dictionaries to cover, for instance, 95% of tokens; but the same dictionaries will cover lower tokens % when working with larger samples.
RAM considerations: with 5% samples we need RAM allocation in the order of 50 - 100 Mbytes. Appropiated data structures and tools for fast search in tables must be considered. (e.g., data.table and ff packages)
During the analysis, the following resources were studied:
From the above analysis and study, the following guidelines have been defined in a work plan (main reference: Speech and Language Processing.:
1- Divide corpus (en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt) into 3 disjoint and randomized sets:
2- Training
3- Predict
I see two posibilities, the first easier to implement than the second:
A - Simple back off algorithm
From N-wordgrams TDMs we can build N-wordgrams count tables. They will be very sparse (many of the cells will be zero, i.e., not seen N-wordrams). We can transform these N-wordgrams count tables to N-wordgrams probability tables.
(Perhaps it’s possible to use directly the TDMs instead of building so sparse N-wordgrams count tables)
The algorithm is as follows:
Next word = sample(colnames, size = 1, prob = probabilities in the row of N-1 bigram)
Back off: If the N-1 bigram is not found (or if its probability is very low ???) in the N-wordgram probability table, repeat the process with the N-2 wordgram probability table and with the N-2 last input words (and so on)
If we arrive to the unigram probability table and no result is found, return “Next word probably not in my training dictionary”.
If we can work just with the TDMs, as they are not too big, we could use N=4 or even N=5.
B - Smoothing + back off/Interpolation Kneser-Ney algorithm
See video and slides about Kneser-Ney Smoothing.
With this algorithm we end up with a smoothed N-wordgram probability table (no zeros). The probabilities of the original zeros is obtaining discounting from the original seen N-grams probabilities following an elegant backoff/interpolation strategy.
Difficulties can arise from the fact that this N-wordgram probability table could be really huge. A RAM estimation is needed to choose N.
In this case we just have to sample 1 word from the N-1 bigram row in the N-wordgram probability table taking the probabilities in this row as weigths:
Next word = sample(colnames, size = 1, prob = probabilities in the row of N-1 bigram)
Only in case no result is found we should return “Next word probably not in my training dictionary”.
4- Test and Evaluate model
Accuracy and perplexity measures of N-grams models trained on the training set will be taken on the test set.
This cycle will be developed completely with the easier prediction algorithm in order to build a shiny application as soon as possible. If time is left, the second prediction algorithm will be implemented and compared to the first one.
This is the current situation of the development of the algorithm:
Training and testing sample files have been created for each blogs, news and twitter data files. Training files = 80 % and testing files = 20 % of each data file.
Training. For each training file a sample has been created.
Determining dictionaries Each sample has been tokenized, after processing punctuation, in order to obtain 1-grams TDMs. Stats has been calculated to determine how many counts are necessary to retain a token in a dictionary (criteria: 95% token coverage). Based on that, dictionaries have been created.
After that, all tokens not in the dictionary have been reemplaced in the whole corpus by the label \(<UNK>\).
Next, new TDMs have been calculated, now for 1,2,3 and 4-grams. In the case of 2, 3 and 4-grams, pseudo-words have been added at the beggining and at the end oeach line.
From these TDMs, Table Frequency Lists (TFLs) have been derived and stored as data.table’s. Now we have a model (4 TFLs each: 1, 2, 3 and 4-grams) for Blogs, News and Twitter documents and for the whole Corpus. Blogs, News and Twitter models use 40 Mbytes data tables and Corpus model, 70 Mbytes.
A prediction function have been implemented, following a simple back off algorithm beginning with 4-grams.
Testing. We have built 4-gram TFLs (for Blogs, News, Twitter and whole Corpus) from the test set. The prediction function, with the appropiated model, is feeded by the first three tokens of each testing 4-gram and the prediction is compared with the fourth token. Accuracy will be measured carrying out 100 times 1000 predictions, to calculate an average and a confidence interval. The models and prediction function are been tested nowadays.
Evaluation. The four models (Blogs, News, Twitter and Corpus) accuracies will be evaluated against its respective testing sets in the two modes explained above (with and without final weighted sampling).
Further work After implementing this models in a Shiny application, a new model (Kneser-Ney algorithm) will be implemented, evaluated and compared with the actual models before adding it to the Shiny application.