Introduction

The aim of this project is to develop an application that would predict a word on the basis of its history in a word sequence, i.e. on the basis of the words that appeared before it in the text. This can be achieved by building an n-gram model. n-grams, also called shingles, are word sequences of the length n that can be identified in the text. If we could find out that after a sequence [x y x] the word k appears much more frequent than other words, we could predict the k will likely follow [x y x].

In this report I clean and analyse available data sets, discover word and sequences common in English language and estimate how many shingles of each class would likely cover a certain portion of the a text sample.

Data

The data was originally obtained from http://www.corpora.heliohost.org/ and is downloadable at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The corpora analysed here were US English language samples and were collected from blogs, news and Twitter, respectively:

bl=readLines("en_US/en_US.blogs.txt")
ns=readLines("en_US/en_US.news.txt")
tw=readLines("en_US/en_US.twitter.txt")
## Warning in readLines("en_US/en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul

The number of lines, total words and words per line were calculated:

ST1=sapply(list("bl"=bl,"ns"=ns,"tw"=tw), function(x){
    w=sum(sapply(gregexpr("\\S+", x), length))
    c("lines"=length(x),"words"=w, "words/line"=round(w/length(x), 0))
})
ST1
##                  bl       ns       tw
## lines        899288  1010242  2360148
## words      37334131 34372530 30373543
## words/line       42       34       13

For convenience of the exploratory analysis, only samples of each corpora were used. Sampling was performed by line, and amount of line was taken such that each sample contained roughly one million words. Additionally a mix sample was taken from all 3 corpora:

bl=sample(bl,1000000/ST1[3,1])
ns=sample(ns,1000000/ST1[3,2])
tw=sample(tw,1000000/ST1[3,3])
mix=c(bl,ns,tw)
mix=sample(mix,length(mix)/3)

Transforming and cleaning the data

Lines were further split to sentences using sequences of dots, exclamations and question marks as separators. These sequences were preserved and separated with a space, so each such sequence was treated as a token. This way n-grams which are to be generated won’t span through sentences, but rather the whole n-gram will be a part of one and the same sentence.

sentence=function(x, char_end="([.!?]+)"){
    unlist(strsplit(gsub(paste(char_end, sep="")," \\1\n", x), " *\n *"))
}

Other special characters were used for tokenisation as word separators and were replaced by the space. This allowed separating words in phrases like “cat/dog” or misspelled phrases where the space has been omitted, like “cat,dog”.

Dollar and percent characters were treated as tokens, i.e. were preserved and surrounded by spaces.

Words were separated using the space character (or it’s sequences) as the separator.

token=function(x, char_split="([\U00000080-\U0010FFFF()/&_,:;\\^*+=\"{}@#~|]|\\[|\\]|\\-)", char_token="([$%])"){
    x=gsub(char_split, " ", x)
    x=gsub(char_token, " \\1 ", x)
    x=unlist(strsplit(x," +"))
    if (all (x=="")) x=c() else x=x[x!=""]
    x
}

Finally whole sentences containing profanities were removed. The list of profanities was obtained from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt:

profs=readLines("profanities.txt")

The above steps were implemented as one function and executed on all 4 samples.

tokenise=function(x){
    x=sentence(x)
    x=lapply(x,token)
    y=sapply(x, function(z) any(z %in% profs))
    x=x[!y]
}

mix=tokenise(mix)
bl=tokenise(bl)
ns=tokenise(ns)
tw=tokenise(tw)

Here is an example of a tokenised sentence:

bl[[1]]
## [1] "Another"  "decision" "!"

The numbers of sentences and tokens in the tokenised samples were calculated:

ST2=sapply(list("bl"=bl,"ns"=ns,"tw"=tw, "mix"=mix), function(x){
    t=sum(sapply(x,length))
    c("sentences"=length(x),"tokens"=t,"tokens/sentences"=round(t/length(x), 0))
})
ST2
##                      bl     ns      tw    mix
## sentences         63122  65041  129659  85628
## tokens           936124 963707 1002682 963408
## tokens/sentences     15     15       8     11

For each sentence of the mix sample quadru-grams (4-grams) were created and stored in a vector:

shingle=function(x, n){
    l=length(x)
    if (l>=n) sapply(1:(l-n+1), function(i) paste(x[i:(i+n-1)], collapse=" ")) else c()
}

TokensToShingles = function (text, n){
    out=unlist(sapply(text, function(x) shingle(x,n)))
}

mix4=TokensToShingles(mix,4)

Frequencies of each quadru-gram species were calculated, i.e. how many times each species appears in the set of all identified quadru-grams. The resulting frequency vector was sorted in the descending order:

mix4=table(mix4)
mix4=mix4[order(mix4, decreasing = T)]

Analogously, frequencies of tri-, di- and uni-grams were calculated:

TokensToShingleTable=function(x, n){
    x=TokensToShingles(x, n)
    x=table(x)
    x=x[order(x, decreasing = T)]
}

mix3=TokensToShingleTable(mix,3)
mix2=TokensToShingleTable(mix,2)
mix1=TokensToShingleTable(mix,1)

Exploratory analysis

Most frequent shingles

Most common shingles of each class were obtained. The top 10 common uni-grams or words appeared to be:

mix1[1:10]
## x
##     .   the    to     a   and    of     I    in   for     ! 
## 50555 35377 23562 19209 18959 16425 14364 12915  9483  9315

Most common bi-grams:

mix2[1:10]
## x
##  of the  in the  to the for the  on the   to be  at the and the    in a 
##    3430    3124    1832    1655    1633    1382    1056     920     884 
##    is a 
##     855

tri-grams:

mix3[1:10]
## x
##       a lot of     one of the      he said .    going to be Thanks for the 
##            251            242            175            164            152 
##        to be a     out of the      I want to     the end of    some of the 
##            152            136            129            119            115

and quadru-grams:

mix4[1:10]
## mix4
##        the end of the       the rest of the    for the first time 
##                    58                    56                    42 
##      in the middle of        is going to be         at the end of 
##                    42                    41                    36 
##      at the same time Thanks for the follow         is one of the 
##                    34                    33                    32 
##      when it comes to 
##                    31

Sample coverage by shigles

Frequencies were divided by the total number of shingles of a certain class (uni, bi, tri or quadru) to obtain fractions. Then cumulative fractions were calculated.

cum_fr=function(x){
    cumsum(x/sum(x))
}

plot_cover=function(x, ...){
    plot(seq(1,length(x),1000), x[seq(1,length(x),1000)], xlab="shingle number", ylab="coverage", ...)
}

mix1c=cum_fr(mix1)
mix2c=cum_fr(mix2)
mix3c=cum_fr(mix3)
mix4c=cum_fr(mix4)

For example:

mix3c[100]
## I decided to 
##  0.009010835

means that the first 100 most frequent tri-grams cover 0.9% of the whole text sample.

Cumulative fraction were plotted against shingle number to show how many of the first most frequent shingles cover a certain amount of the text sample:

par(mfcol=c(2,2))
plot_cover(mix1c, main="uni-grams")
plot_cover(mix2c, main="bi-grams")
plot_cover(mix3c, main="tri-grams")
plot_cover(mix4c, main="quadru-grams")

par(mfcol=c(1,1))

Coverage of 50% and 90% was calculated for each shingle class.

sapply(list("uni-grams"=mix1c, "bi-grams"=mix2c, "tri-grams"=mix3c, "quadru-grams"=mix4c),
       function(x) c("50%"=sum(x<0.5),"90%"=sum(x<0.9)))
##     uni-grams bi-grams tri-grams quadru-grams
## 50%       156    37374    279301       340981
## 90%      9513   335639    599049       630873

About 150 words cover 50% of the text while about 9 thousand is enough to cover 90%. On the other hand, 340 thousands quadru-gram species is needed to cover just 50%.

Similarity between different sources

To compare different data sources in the context of shared phrases, tri-grams were calculated for samples from blogs, news and twitter:

bl3=TokensToShingleTable(bl,3)
ns3=TokensToShingleTable(ns,3)
tw3=TokensToShingleTable(tw,3)

Cumulative fractions were used to identify most frequent tri-grams able to cover 50% of each sample:

bl3c=cum_fr(bl3)
ns3c=cum_fr(ns3)
tw3c=cum_fr(tw3)

bl3_=names(bl3c[bl3c<0.5])
ns3_=names(ns3c[ns3c<0.5])
tw3_=names(tw3c[tw3c<0.5])

Venn diagrams were constructed to show how many of these most frequent tri-grams are shared between the three data sets:

library(limma)

pool=unique(c(bl3_,ns3_,tw3_))
counts=sapply(list("blogs"=bl3_,"news"=ns3_,"twitter"=tw3_), function(x) pool %in% x)

vennDiagram(vennCounts(counts))

Only a very small fraction is common to all 3 data sets. The intersection between pairs are also small. The majority of the the 3 word phrases is specific for certain sources.

A similar pattern is observed if all tri-grams are taken into account:

pool2=unique(c(names(bl3),names(ns3),names(tw3)))
counts2=sapply(list("blogs"=names(bl3),"news"=names(ns3),"twitter"=names(tw3)), function(x) pool2 %in% x)

vennDiagram(vennCounts(counts2))

This would indicate that the source of the text matters and it is important to use sources of different language style to construct a training set for model development.