Natural language processing (NLP) is a science of interaction between computer and human language. One instance of the NLP in our daily life is SwiftKey Keyboard, which help people to boost their input in smart phone. In this report, we are going to leverage the Ngram prediction to do the similar job, by using the linguistic corpus from HC Corpora.

The Ngram we taked is a contiguous sequence of n items (token) from a given sequence of text or speech. It can be unigram for a single items, and bigram for two, trigram for three and so on so forth.

Load the data

The data used includes 4 languages: de, en, fi, ru. For each language, there are 3 text file, one from blogs, one from news and one from twitters.

Before loading these files, a basic summary could be retrieved to estimate the workload before hand. We can raise a bash(or cygwin in windows), and execute wc -l * to read the total lines, wc -L * to get the maximal length, and wc -w * to get the total words.

File Total Lines Total Words Maximal Characters per Line
en_US.blogs.txt 899288 3.72e7 40833
en_US.news.txt 1010242 3.42e7 11384
en_US.twitter.txt 2360148 3.04e7 173

We can use following code to read conetents of these files. In some envirment, the encoding setup is important, one can use encoding = "UTF-8" argument in readLines or later by Encoding() <- "UTF-8" to change the encoding set for a perticular region. The translate() function in tau package can do the same thing. But we find setup a encoding is a little bit tricky. And unknown is probably the best choice.

workdir <- "A WORKING DIR CONTAIN 3 text"
fList <- list.files(workdir)
for (i in 1:length(fList)) {
    ## read each file by readLines
}

After we load the data we can do arbitray random sampling to constrain the raw data into 1e4 lines per file. This step will speed up the code execution and release the memory occupation in exploratory study, but may not be a good option in building Ngram prediction. Later on, we will show how we manage to do Ngram prediction based on the huge original raw data.

After we get the smaller chunk of data, we can start tokenization.

Tokenization

Tokenization, basically means the segment original text into words, pulsing post-processing, in order to prepare a data set for any further analysis.

A common practise of tokenization could but not necessarily include,

  1. separate into sentence.
  2. segment text into words, the word boundary could be ‘white space’ or ‘leading and trailing quotation’.
  3. handling abbreviations.
  4. handling different hyphenated words.
  5. etc.

Tokenization by openNLP

A good designed tokenization can be very complex, and one option is in openNLP package, by using a cascaded annotation like following

library(NLP)
library(openNLP)
corpusData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", n = -1, skipNul = T)
set.seed(150)
corpusData <- sample(x = corpusData, size = 1e3, replace = F)

s <- as.String(corpusData)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, word_token_annotator, a1)
wa <- a2[sapply(a2,function(ak){ak$type != "sentence"})]
print(corpusData[30])
[1] "eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman."
nStart <- which(s[wa] == "eBlockwatch")
print(s[wa[nStart:(nStart+50)]])
 [1] "eBlockwatch"      "is"               "one"             
 [4] "of"               "the"              "largest"         
 [7] "crime-fighting"   "community"        "networks"        
[10] "in"               "the"              "country"         
[13] "."                "It"               "has"             
[16] "60"               "458"              "members"         
[19] ","                "according"        "to"              
[22] "its"              "founder"          ","               
[25] "Andre"            "Snyman"           "."               
[28] "Nazim"            "has"              "refused"         
[31] "to"               "comment"          "on"              
[34] "the"              "matter"           "."               
[37] "Since"            "my"               "all-time"        
[40] "favourite"        "Melbourne-filmed" "TV"              
[43] "show"             "RUSH"             "was"             
[46] "sadly"            "AXED"             "last"            
[49] "year"             ","                "I"               

Tokenization by tm

Considering the Ngram pattern (uni-, bi-, and tri-gram at least) as our ultimate goal in the favor of building a prediction, we need the distribution of these Ngram after tokenization.

The tm package provides a different way to do tokenization based on the corpus data structure. And after than we can calculate the term-document matrix directly.

library(tm)
options(mc.cores=1)
library(RWeka)
corpusData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", n = -1, skipNul = T)
set.seed(150)
corpusData <- sample(x = corpusData, size = 1e3, replace = F)

enCorpus <- Corpus(x=VectorSource(corpusData))
print(enCorpus[[30]]$content)
[1] "eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman."
enCorpus <- tm_map(enCorpus, removePunctuation)
print(enCorpus[[30]]$content)
[1] "eBlockwatch is one of the largest crimefighting community networks in the country It has 60 458 members according to its founder Andre Snyman"
enCorpus <- tm_map(enCorpus, removeNumbers)
print(enCorpus[[30]]$content)
[1] "eBlockwatch is one of the largest crimefighting community networks in the country It has   members according to its founder Andre Snyman"
enCorpus <- tm_map(enCorpus, content_transformer(tolower))
print(enCorpus[[30]]$content)
[1] "eblockwatch is one of the largest crimefighting community networks in the country it has   members according to its founder andre snyman"
enCorpus <- tm_map(enCorpus, stripWhitespace)
print(enCorpus[[30]]$content)
[1] "eblockwatch is one of the largest crimefighting community networks in the country it has members according to its founder andre snyman"

One can see the tokenization results provides by the demo code have a little difference. For instance, we didn’t remove the punctuation. OpenNLP might has better results in some extreme case, but tm can give us much more convenient in Ngram generation.

Term-Document matrix and Ngram distribution analysis

Term-Document matrix is build directly on the tokenization results in tm package. In previous demo code, corpusData hasn’t been collapsed, and naturely we have 1K document (columns) in generated the matrix.

Since we might be interesting about the Ngram pattern in different corpus source (blogs, news, twitter), we collaps and regenerate the term-document matrix (tdm) on blog corpus.

TokenizerN1 <- function(x) {
    NGramTokenizer(x$content, Weka_control(min=1, max=1))
}

enControl <- list(tokenize=TokenizerN1)
dtmN1 <- TermDocumentMatrix(x = enCorpus, control = enControl)

d <- as.matrix(dtmN1)
d2 <- data.frame(Term=as.character(rownames(d)),Freq=d[,1])
rownames(d2) <- as.character(1:nrow(d2))
d3 <- d2[order(d2$Freq,decreasing = T),]
print(d3[1:5,])
     Term Freq
6991  the 2079
252   and 1188
6988 that  456
2698  for  382
7547  was  342

And similarly, we can generate bigram and trigram distribution plot.

Use the Ngram to do simple prediction

The method to generate Ngram in previous section, can be used to explore the distribution of token, and there’s still a gap for adoption in real prediction job. The challenges are,

  • We use the sampling to restrain the size of dataset, which brings some arbitrary bias into the distribution analysis.

  • The Ngram analysis only show the most popular token (uni-, bi-, tri-gram) in the corpus, while the answer to next words based on our recommendation may be rely on the context.

These two challenges indicate the previous Ngram analysis based on sampling is useless to our problem.

In our problem, we usually are given a sentence. For instance, The guy in front of me just bought a pound of bacon, a bouquet, and a case of. In this case, we might be more interesting on the sentence with phrase a case of in the whole corpus. So our first step, is grasp those sentence from corpus, to generate our interrogation target set. This step can decrease the total dataset as well, meanwhile be more purpose driven. And not even mention that we will not bring any bias by sampling.

The second step is tokenization and Ngram (uni-, bi-, tri-gram) analysis as we did in previous sections.

The third step is to decide the next words, currectly it is done by following steps,

  1. We try search our interesting term (e.g. a case of) in our four-gram pool, if we can find a phrase in four-gram, with remarkable high emerge frequency and one extra word after our interesting term, we’ll recommend it.

  2. Based on step 1, if we can’t find our champion, we try search a shorter version of our interesting term (e.g. case of) in our tri-gram pool, if we can find a phrase with remarkable high emerge frequency and one extra word after, we’ll take it.

  3. Based on step 2, if we can’t still find our champion, we’ll try shorten the interesting term and do again.

Using these 3 steps, we have 80% positive rate in the test quizz.

Further potential improvement