The progress of the student that explores four languages corpus using Natural Language Processing (NLP) is reported below. The data are loaded, cleaned and an exploratory statistical analysis is applied. For some calculations sampling helps to have a faster result. A technical report including the reproducible code can be obtained from github.

We observe the frequency of appearence of words and two/three word phrases, called ngrams. The most frequent ngrams of the english corpus are visualized. The main observation at this point is that in english, 2000-3000 words are enough to cover the 90% of the corpus. This will allow us to use a smaller set of words for our ngrams matrix, to improve the speed of the prediction.

Data statistics and methodology

The dataset is obtained from Amazon S3 url provided by the instructions, mirrored from the website http://www.corpora.heliohost.org which has the original source of the corpus, maintained by Hans Christensen.

The dataset contains news, blogs and tweets in four different languages, English, German, Russian and Finish.

dataset words characters lines
de_DE.blogs.txt 12,653,185 85,459,666 371,440
de_DE.news.txt 13,219,388 95,591,959 244,743
de_DE.twitter.txt 11,803,735 75,578,341 947,774
en_US.blogs.txt 37,334,690 210,160,014 899,288
en_US.news.txt 34,372,720 205,811,889 1,010,242
en_US.twitter.txt 30,374,206 167,105,338 2,360,148
fi_FI.blogs.txt 12,732,013 108,503,595 439,785
fi_FI.news.txt 10,446,725 94,234,350 485,758
fi_FI.twitter.txt 3,153,003 25,331,142 285,214
ru_RU.blogs.txt 9,691,167 116,855,835 337,100
ru_RU.news.txt 9,416,099 118,996,424 196,360
ru_RU.twitter.txt 9,542,485 105,182,346 881,414

Apart from the word count statistics, we can also extract other useful information, for example the english has the following statistics:

The NLP pipeline involves the steps shown below, which this report partially follows.

Data Exploration

Acquisition, cleaning, sampling

The load of the data in R has been done in the Corpus data structure, provided by the text mining framework library, tm. That loads the corpus in to the memory.

Data frame is not a good data type to load the text, because it is prone to dimentionality problems. Corpus is using lists.

The cleaning of the datasets has been done as follows:

  • Multi-sentenses paragraphs (blog posts, news articles) have been broken to seperate entries in our dataset.
  • All capital leters have been transformed to their equivalent lower, (ex. “This” has been transformed to “this”) so our algorithm will summarize their apperence in one key.
  • All numbers have been removed since it is not useful for the prediction, and will keep our dataset small.
  • Punctuation has been removed, to have the words counted as one key in our algorithm (ex. “this,” will be counted as “this”)
  • Whitespaces have been stripped (ex " this " will be counted as “this”)

Although usually the stop-words are removed from a dataset, it has not been followed because we are looking for a predictive model on text, and we don’t want to miss these words.

Instead of importing the whole files in our dataset via the Corpus function, a sample of the data have been used. 100.000 lines per media type (twitter, blogs, news) is enough to safely conclude on the statistics of the english language.

Tokenization

Multiple functions provide tokenization for R.

  • scan_tokenizer() splits the text of the corpus to a character vector, by using the blankspace as the delimiter. Anything between spaces is considered a word.
  • MC_tokenizer() splits the text of the corpus to a character vector, and ignores the punctuation, parenthesis, numbers, etc.
  • NGramTokenizer() splits a string to n-grams, so we will have not only unigrams (words) like the two tokenizers above can give, but also bigrams, trigrams, etc (phrases of two/three words).

By using the NGramTokenizer method, we calculate the unigrams/bigrams/trigrams for our english corpus, which will be used for our prediction model.

par(mfrow=c(1,4))

ngram <- 1
options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
utdm <- TermDocumentMatrix(docs, control = list(tokenize = UnigramTokenizer))
uni<-rowSums(as.matrix(utdm))

ngram <- 2
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
btdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer))
bi<-rowSums(as.matrix(btdm))

ngram <- 3
options(mc.cores=1)
TrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
ttdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))
tri<-rowSums(as.matrix(ttdm))

ngram <- 4
options(mc.cores=1)
QuadrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
qtdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadrigramTokenizer))
qua<-rowSums(as.matrix(qtdm))

barplot(tail(sort(uni),10), las = 2, main = "Top 10 Unigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(bi),10), las = 2, main = "Top 10 Bigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(tri),10), las = 2, main = "Top 10 Trigrams",cex.main = 1,horiz=TRUE)
barplot(tail(sort(qua),10), las = 2, main = "Top 10 Quadrigrams",cex.main = 1,horiz=TRUE)

plot of chunk unnamed-chunk-3

The tokenization of the corpus in ngrams, results in a matrix of terms and their appearence count.

This matrix is called Term Document Matrix (TDM) and it’s our main data type which we are going to use for the predictions.

The following example shows how the TDM is built, and what does the content of it look like.

QuadrigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=4,max=4))
qtdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadrigramTokenizer))
qua<-rowSums(as.matrix(qtdm))
tail(sort(rowSums(as.matrix(qtdm))))
##    one of the most   at the same time for the first time 
##                421                436                545 
##    the rest of the      at the end of     the end of the 
##                595                659                728

This matrix can be replaced by an Elasticsearch database, which also supports natively ngrams tokenization, and offer the scalability that we cannot reach using a single compute device.

Next word prediction exercise

In the provided exercises, the frequency of the quadrigrams or trigrams of the end of a given phrase has been evaluated, to select the best answer:

quad<-data.frame(sort(rowSums(as.matrix(qtdm)),decreasing=TRUE))
quad['would mean the world',]
## [1] 8
quad['would mean the most',]
## [1] NA
quad['would mean the universe',]
## [1] NA
quad['would mean the best',]
## [1] NA

In the example above, we observe that would mean the world has the most references, the biggest probability that this is the next word.

In a similar way, we expect to use the ngrams to predict the next word in a phrase, by matching the phrases which have the biggest probability of appearence.

Profanity Filtering

The sentenses that contain a swear word have been completely removed because:

  • The removal of only a swear word results in a sentence which doesn’t help the prediction
  • The list of entries in the corpus which include swear words is relatively small

This was done by using a google list of swearing words which is available through the dwyl.com website. @jamiew created a list of the words in https://gist.github.com/jamiew/1112488.

Distribution of word frequencies

By analysing the TDM further, we observe that the distribution of the unigrams (words) is inversely proportional to its rank in the frequency matrix. This is also known as [Zipf’s Law](http://en.wikipedia.org/wiki/Zipf's_law) and its seen in all languages and media type.

plot of chunk unnamed-chunk-7

If we sort the TDM by frequency of word appearence, we’ll need the following amount of words to cover half of the word instances per language/media type:

language twitter blogs news
en_US 195 116 247
ru_RU 576 594 711
fi_FI 1661 1665 2711
de_DE 151 88 143

To cover the 90% of the word instances we’ll need:

language twitter blogs news
en_US 4041 1876 3496
ru_RU 18508 17489 16860
fi_FI 25092 42404 41674
de_DE 6771 3238 5773

Next steps

As explained, the TDM is the data type which NLP uses to store the indexing of a language corpus. That index, is being used as we’ve seen in the example, to make a prediction of the next word following a given phrase.

Two main problems arrise and will be part of the next weeks studies/work.

Data engineering problem

The calculation time for a full corpus to be indexed is huge, not acceptable for repeating the indexing process and try multiple algorithms.

There are two workarounds to solve this problem:

  • To scale out our computation element, by utilizing a solr/elasticsearch cluster which natively support ngrams tokenization or
  • To keep using the sample of the corpus instead of the entire corpus

Elasticsearch indexing seems to be the best option, given the fact that the outcome is expected to work through a web API on a web interface instead of a mobile device like Swiftkey does.

This will allow us to use a full quadrigram index as the first option to match the user input before the trigram index, as there will be no memory limitations.

NLP/data scientist problems