The aim of this project is to develop an application that would predict a word on the basis of its history in a word sequence, i.e. on the basis of the words that appeared before it in the text. This can be achieved by building an n-gram model. n-grams, also called shingles, are word sequences of the length n that can be identified in the text. If we could find out that after a sequence [x y x] the word k appears much more frequent than other words, we could predict the k will likely follow [x y x].
In this report I clean and analyse available data sets, discover word and sequences common in English language and estimate how many shingles of each class would likely cover a certain portion of the a text sample.
The data was originally obtained from http://www.corpora.heliohost.org/ and is downloadable at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The corpora analysed here were US English language samples and were collected from blogs, news and Twitter, respectively:
bl=readLines("en_US/en_US.blogs.txt")
ns=readLines("en_US/en_US.news.txt")
tw=readLines("en_US/en_US.twitter.txt")
## Warning in readLines("en_US/en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
The number of lines, total words and words per line were calculated:
ST1=sapply(list("bl"=bl,"ns"=ns,"tw"=tw), function(x){
w=sum(sapply(gregexpr("\\S+", x), length))
c("lines"=length(x),"words"=w, "words/line"=round(w/length(x), 0))
})
ST1
## bl ns tw
## lines 899288 1010242 2360148
## words 37334131 34372530 30373543
## words/line 42 34 13
For convenience of the exploratory analysis, only samples of each corpora were used. Sampling was performed by line, and amount of line was taken such that each sample contained roughly one million words. Additionally a mix sample was taken from all 3 corpora:
bl=sample(bl,1000000/ST1[3,1])
ns=sample(ns,1000000/ST1[3,2])
tw=sample(tw,1000000/ST1[3,3])
mix=c(bl,ns,tw)
mix=sample(mix,length(mix)/3)
Lines were further split to sentences using sequences of dots, exclamations and question marks as separators. These sequences were preserved and separated with a space, so each such sequence was treated as a token. This way n-grams which are to be generated won’t span through sentences, but rather the whole n-gram will be a part of one and the same sentence.
sentence=function(x, char_end="([.!?]+)"){
unlist(strsplit(gsub(paste(char_end, sep="")," \\1\n", x), " *\n *"))
}
Other special characters were used for tokenisation as word separators and were replaced by the space. This allowed separating words in phrases like “cat/dog” or misspelled phrases where the space has been omitted, like “cat,dog”.
Dollar and percent characters were treated as tokens, i.e. were preserved and surrounded by spaces.
Words were separated using the space character (or it’s sequences) as the separator.
token=function(x, char_split="([\U00000080-\U0010FFFF()/&_,:;\\^*+=\"{}@#~|]|\\[|\\]|\\-)", char_token="([$%])"){
x=gsub(char_split, " ", x)
x=gsub(char_token, " \\1 ", x)
x=unlist(strsplit(x," +"))
if (all (x=="")) x=c() else x=x[x!=""]
x
}
Finally whole sentences containing profanities were removed. The list of profanities was obtained from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt:
profs=readLines("profanities.txt")
The above steps were implemented as one function and executed on all 4 samples.
tokenise=function(x){
x=sentence(x)
x=lapply(x,token)
y=sapply(x, function(z) any(z %in% profs))
x=x[!y]
}
mix=tokenise(mix)
bl=tokenise(bl)
ns=tokenise(ns)
tw=tokenise(tw)
Here is an example of a tokenised sentence:
bl[[1]]
## [1] "Another" "decision" "!"
The numbers of sentences and tokens in the tokenised samples were calculated:
ST2=sapply(list("bl"=bl,"ns"=ns,"tw"=tw, "mix"=mix), function(x){
t=sum(sapply(x,length))
c("sentences"=length(x),"tokens"=t,"tokens/sentences"=round(t/length(x), 0))
})
ST2
## bl ns tw mix
## sentences 63122 65041 129659 85628
## tokens 936124 963707 1002682 963408
## tokens/sentences 15 15 8 11
For each sentence of the mix sample quadru-grams (4-grams) were created and stored in a vector:
shingle=function(x, n){
l=length(x)
if (l>=n) sapply(1:(l-n+1), function(i) paste(x[i:(i+n-1)], collapse=" ")) else c()
}
TokensToShingles = function (text, n){
out=unlist(sapply(text, function(x) shingle(x,n)))
}
mix4=TokensToShingles(mix,4)
Frequencies of each quadru-gram species were calculated, i.e. how many times each species appears in the set of all identified quadru-grams. The resulting frequency vector was sorted in the descending order:
mix4=table(mix4)
mix4=mix4[order(mix4, decreasing = T)]
Analogously, frequencies of tri-, di- and uni-grams were calculated:
TokensToShingleTable=function(x, n){
x=TokensToShingles(x, n)
x=table(x)
x=x[order(x, decreasing = T)]
}
mix3=TokensToShingleTable(mix,3)
mix2=TokensToShingleTable(mix,2)
mix1=TokensToShingleTable(mix,1)
Most common shingles of each class were obtained. The top 10 common uni-grams or words appeared to be:
mix1[1:10]
## x
## . the to a and of I in for !
## 50555 35377 23562 19209 18959 16425 14364 12915 9483 9315
Most common bi-grams:
mix2[1:10]
## x
## of the in the to the for the on the to be at the and the in a
## 3430 3124 1832 1655 1633 1382 1056 920 884
## is a
## 855
tri-grams:
mix3[1:10]
## x
## a lot of one of the he said . going to be Thanks for the
## 251 242 175 164 152
## to be a out of the I want to the end of some of the
## 152 136 129 119 115
and quadru-grams:
mix4[1:10]
## mix4
## the end of the the rest of the for the first time
## 58 56 42
## in the middle of is going to be at the end of
## 42 41 36
## at the same time Thanks for the follow is one of the
## 34 33 32
## when it comes to
## 31
Frequencies were divided by the total number of shingles of a certain class (uni, bi, tri or quadru) to obtain fractions. Then cumulative fractions were calculated.
cum_fr=function(x){
cumsum(x/sum(x))
}
plot_cover=function(x, ...){
plot(seq(1,length(x),1000), x[seq(1,length(x),1000)], xlab="shingle number", ylab="coverage", ...)
}
mix1c=cum_fr(mix1)
mix2c=cum_fr(mix2)
mix3c=cum_fr(mix3)
mix4c=cum_fr(mix4)
For example:
mix3c[100]
## I decided to
## 0.009010835
means that the first 100 most frequent tri-grams cover 0.9% of the whole text sample.
Cumulative fraction were plotted against shingle number to show how many of the first most frequent shingles cover a certain amount of the text sample:
par(mfcol=c(2,2))
plot_cover(mix1c, main="uni-grams")
plot_cover(mix2c, main="bi-grams")
plot_cover(mix3c, main="tri-grams")
plot_cover(mix4c, main="quadru-grams")
par(mfcol=c(1,1))
Coverage of 50% and 90% was calculated for each shingle class.
sapply(list("uni-grams"=mix1c, "bi-grams"=mix2c, "tri-grams"=mix3c, "quadru-grams"=mix4c),
function(x) c("50%"=sum(x<0.5),"90%"=sum(x<0.9)))
## uni-grams bi-grams tri-grams quadru-grams
## 50% 156 37374 279301 340981
## 90% 9513 335639 599049 630873
About 150 words cover 50% of the text while about 9 thousand is enough to cover 90%. On the other hand, 340 thousands quadru-gram species is needed to cover just 50%.
To compare different data sources in the context of shared phrases, tri-grams were calculated for samples from blogs, news and twitter:
bl3=TokensToShingleTable(bl,3)
ns3=TokensToShingleTable(ns,3)
tw3=TokensToShingleTable(tw,3)
Cumulative fractions were used to identify most frequent tri-grams able to cover 50% of each sample:
bl3c=cum_fr(bl3)
ns3c=cum_fr(ns3)
tw3c=cum_fr(tw3)
bl3_=names(bl3c[bl3c<0.5])
ns3_=names(ns3c[ns3c<0.5])
tw3_=names(tw3c[tw3c<0.5])
Venn diagrams were constructed to show how many of these most frequent tri-grams are shared between the three data sets:
library(limma)
pool=unique(c(bl3_,ns3_,tw3_))
counts=sapply(list("blogs"=bl3_,"news"=ns3_,"twitter"=tw3_), function(x) pool %in% x)
vennDiagram(vennCounts(counts))
Only a very small fraction is common to all 3 data sets. The intersection between pairs are also small. The majority of the the 3 word phrases is specific for certain sources.
A similar pattern is observed if all tri-grams are taken into account:
pool2=unique(c(names(bl3),names(ns3),names(tw3)))
counts2=sapply(list("blogs"=names(bl3),"news"=names(ns3),"twitter"=names(tw3)), function(x) pool2 %in% x)
vennDiagram(vennCounts(counts2))
This would indicate that the source of the text matters and it is important to use sources of different language style to construct a training set for model development.