The progress of the student that explores four languages corpus using Natural Language Processing (NLP) is reported below. The data are loaded, cleaned and an exploratory statistical analysis is applied. For some calculations sampling helps to have a faster result. A technical report including the reproducible code can be obtained from github.
We observe the frequency of appearence of words and two/three word phrases, called ngrams. The most frequent ngrams of the english corpus are visualized. The main observation at this point is that in english, 2000-3000 words are enough to cover the 90% of the corpus. This will allow us to use a smaller set of words for our ngrams matrix, to improve the speed of the prediction.
The dataset is obtained from Amazon S3 url provided by the instructions, mirrored from the website http://www.corpora.heliohost.org which has the original source of the corpus, maintained by Hans Christensen.
The dataset contains news, blogs and tweets in four different languages, English, German, Russian and Finish.
| dataset | words | characters | lines |
|---|---|---|---|
| de_DE.blogs.txt | 12,653,185 | 85,459,666 | 371,440 |
| de_DE.news.txt | 13,219,388 | 95,591,959 | 244,743 |
| de_DE.twitter.txt | 11,803,735 | 75,578,341 | 947,774 |
| en_US.blogs.txt | 37,334,690 | 210,160,014 | 899,288 |
| en_US.news.txt | 34,372,720 | 205,811,889 | 1,010,242 |
| en_US.twitter.txt | 30,374,206 | 167,105,338 | 2,360,148 |
| fi_FI.blogs.txt | 12,732,013 | 108,503,595 | 439,785 |
| fi_FI.news.txt | 10,446,725 | 94,234,350 | 485,758 |
| fi_FI.twitter.txt | 3,153,003 | 25,331,142 | 285,214 |
| ru_RU.blogs.txt | 9,691,167 | 116,855,835 | 337,100 |
| ru_RU.news.txt | 9,416,099 | 118,996,424 | 196,360 |
| ru_RU.twitter.txt | 9,542,485 | 105,182,346 | 881,414 |
Apart from the word count statistics, we can also extract other useful information, for example the english has the following statistics:
The NLP pipeline involves the steps shown below, which this report partially follows.
The load of the data in R has been done in the Corpus data structure, provided by the text mining framework library, tm. That loads the corpus in to the memory.
Data frame is not a good data type to load the text, because it is prone to dimentionality problems. Corpus is using lists.
The cleaning of the datasets has been done as follows:
Although usually the stop-words are removed from a dataset, it has not been followed because we are looking for a predictive model on text, and we don’t want to miss these words.
Instead of importing the whole files in our dataset via the Corpus function, a sample of the data have been used. 100.000 lines per media type (twitter, blogs, news) is enough to safely conclude on the statistics of the english language.
Multiple functions provide tokenization for R.
By using the NGramTokenizer method, we calculate the unigrams/bigrams/trigrams for our english corpus, which will be used for our prediction model.
par(mfrow=c(1,4))
ngram <- 1
options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
utdm <- TermDocumentMatrix(docs, control = list(tokenize = UnigramTokenizer))
uni<-rowSums(as.matrix(utdm))
ngram <- 2
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
btdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer))
bi<-rowSums(as.matrix(btdm))
ngram <- 3
options(mc.cores=1)
TrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
ttdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))
tri<-rowSums(as.matrix(ttdm))
ngram <- 4
options(mc.cores=1)
QuadrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
qtdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadrigramTokenizer))
qua<-rowSums(as.matrix(qtdm))
barplot(tail(sort(uni),10), las = 2, main = "Top 10 Unigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(bi),10), las = 2, main = "Top 10 Bigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(tri),10), las = 2, main = "Top 10 Trigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(qua),10), las = 2, main = "Top 10 Quadrigrams",cex.main = 1,horiz=TRUE)
The tokenization of the corpus in ngrams, results in a matrix of terms and their appearence count.
This matrix is called Term Document Matrix (TDM) and it’s our main data type which we are going to use for the predictions.
The following example shows how the TDM is built, and what does the content of it look like.
QuadrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=4,max=4))
qtdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadrigramTokenizer))
qua<-rowSums(as.matrix(qtdm))
tail(sort(rowSums(as.matrix(qtdm))))
## one of the most at the same time for the first time
## 421 436 545
## the rest of the at the end of the end of the
## 595 659 728
This matrix can be replaced by an Elasticsearch database, which also supports natively ngrams tokenization, and offer the scalability that we cannot reach using a single compute device.
In the provided exercises, the frequency of the quadrigrams or trigrams of the end of a given phrase has been evaluated, to select the best answer:
quad<-data.frame(sort(rowSums(as.matrix(qtdm)),decreasing=TRUE))
quad['would mean the world',]
## [1] 8
quad['would mean the most',]
## [1] NA
quad['would mean the universe',]
## [1] NA
quad['would mean the best',]
## [1] NA
In the example above, we observe that would mean the world has the most references, the biggest probability that this is the next word.
In a similar way, we expect to use the ngrams to predict the next word in a phrase, by matching the phrases which have the biggest probability of appearence.
The sentenses that contain a swear word have been completely removed because:
This was done by using a google list of swearing words which is available through the dwyl.com website. @jamiew created a list of the words in https://gist.github.com/jamiew/1112488.
By analysing the TDM further, we observe that the distribution of the unigrams (words) is inversely proportional to its rank in the frequency matrix. This is also known as [Zipf’s Law](http://en.wikipedia.org/wiki/Zipf's_law) and its seen in all languages and media type.
If we sort the TDM by frequency of word appearence, we’ll need the following amount of words to cover half of the word instances per language/media type:
| language | blogs | news | |
|---|---|---|---|
| en_US | 195 | 116 | 247 |
| ru_RU | 576 | 594 | 711 |
| fi_FI | 1661 | 1665 | 2711 |
| de_DE | 151 | 88 | 143 |
To cover the 90% of the word instances we’ll need:
| language | blogs | news | |
|---|---|---|---|
| en_US | 4041 | 1876 | 3496 |
| ru_RU | 18508 | 17489 | 16860 |
| fi_FI | 25092 | 42404 | 41674 |
| de_DE | 6771 | 3238 | 5773 |
As explained, the TDM is the data type which NLP uses to store the indexing of a language corpus. That index, is being used as we’ve seen in the example, to make a prediction of the next word following a given phrase.
Two main problems arrise and will be part of the next weeks studies/work.
The calculation time for a full corpus to be indexed is huge, not acceptable for repeating the indexing process and try multiple algorithms.
There are two workarounds to solve this problem:
Elasticsearch indexing seems to be the best option, given the fact that the outcome is expected to work through a web API on a web interface instead of a mobile device like Swiftkey does.
This will allow us to use a full quadrigram index as the first option to match the user input before the trigram index, as there will be no memory limitations.