Summary

This report presents the first steps toward building a word prediction application for english language based on a given data set (swiftkey). It contains an overview of n-gram distribution and characteristics of the underlying data.

Data preprocessing

The data set consists of 3 files, containing english text data from US news, twitter and blog entries. Each file is approx. 200MB in size. The files contain the following number of entries:

file number of lines number of words
en_US.blogs.txt 899288 38050950
en_US.news.txt 77259 35628125
en_US.twitter.txt 2360148 31062690

The twitter data contain the most entries, followed by blogs and news.

The files are read and tokenized in Java due to performance reasons. The constraints on the tokenization were (based on the future application, for which capturing sense and grammar is essential):

The following transformations were applied to all the input data:

With much weaker assumptions (stemming, numbers removed, …), the preprocessing can be done in R for example with the tm-package (Feinerer and Hornik (2015)) or quanteda (Kenneth et al. (2016)).

Each file was scanned and tokens were searched with the following regular expression for uni-/bi-/trigrams:

(?=(\\b\\w[\\w']*))
(?=(\\b\\w[\\w']* \\w[\\w']*))
(?=(\\b\\w[\\w']* \\w[\\w']* \\w[\\w']*))

The result of the preprocessing step was a set of three files (corresponding to uni-/bi/tri-grams) with the tokens and the corresponding number of occurrence, ordered in decreasing order according to the number of occurrence. Just the tokens, which make up 90% of the total occurrence and have an occurrence count >1 are used. The files look like:

of the, 428030
in the, 405568
to the, 212043
for the, 199879
on the, 195077
to be, 161846

In the case of trigrams, there were just 19% of the trigrams with an occurrence better than once in the whole data set. Below table shows the number of tokens taken into account and the coverage (number of n-grams in the text corpus covered by the used tokens).

n-gram total used percent used coverage [%]
1 634431 7424 1 90
2 10645599 2708735 25 90
3 31637680 6124071 19 65

Data analysis

The processed data was imported into R after the preprocessing.

library(data.table)

unigrams<-fread("../final/out/en_US/1g.txt", header = F, sep = ",")
colnames(unigrams)<-c("token", "occurrence")
unigrams$index<-seq(1:nrow(unigrams))
bigrams<-fread("../final/out/en_US/2g.txt", header = F, sep = ",")
colnames(bigrams)<-c("token", "occurrence")
bigrams$index<-seq(1:nrow(bigrams))
trigrams<-fread("../final/out/en_US/3g.txt", header = F, sep = ",")
colnames(trigrams)<-c("token", "occurrence")
trigrams$index<-seq(1:nrow(trigrams))

The distribution of the n-grams can be described for rarely occurring tokens by: \[o=ai^k\] with \(o=occurrence\), \(i=index\), \(a\) & \(b\) beeing distribution parameters. For the n-grams used here, the distribution parameters and the distribution are:

u<-glm(log(occurrence) ~ log(index), family="gaussian", data=unigrams)
b<-glm(log(occurrence) ~ log(index), family="gaussian", data=bigrams)
t<-glm(log(occurrence) ~ log(index), family="gaussian", data=trigrams)

fit<-cbind(coef(u), coef(b), coef(t))
print(fit)
##                 [,1]      [,2]       [,3]
## (Intercept) 16.96138 17.614538 13.4069545
## log(index)  -1.11256 -1.143929 -0.8292701
library(ggplot2)
qplot(log(unigrams$index), log(unigrams$occurrence))+geom_abline(slope = fit[2, 1], intercept = fit[1, 1])+labs(x="Log(n-gram index)", y="Log(Number of occurrence)", title="Double logarithmic plot of the unigram occurrence distribution")

index<-c(seq(1:10000), sample(seq(1:nrow(bigrams)), 1000))
qplot(log(bigrams[index]$index), log(bigrams[index]$occurrence))+geom_abline(slope = fit[2, 2], intercept = fit[1, 2])+labs(x="Log(n-gram index)", y="Log(Number of occurrence)", title="Double logarithmic plot of the bigram occurrence distribution")

index<-c(seq(1:10000), sample(seq(1:nrow(trigrams)), 1000))
qplot(log(trigrams[index]$index), log(trigrams[index]$occurrence))+geom_abline(slope = fit[2, 3], intercept = fit[1, 3])+labs(x="Log(n-gram index)", y="Log(Number of occurrence)", title="Double logarithmic plot of the trigram occurrence distribution")

The distribution of the n-grams within the corresponding class is shown in the graphs below. Not surprinsingly, a lot of the common english “stopwords” are on the top of the unigram distribution. The trigram distribution on the other hand starts capturing sense and grammar like in the case of looking forward to or can’t wait to.

hu<-head(unigrams, n=50)
qplot(hu$token, hu$occurrence/sum(unigrams$occurrence)) + geom_bar(position = "dodge", stat="identity")+coord_flip() + scale_x_discrete(limits=hu$token[order(hu$occurrence, decreasing=F)])+labs(x="Unigram token", y="Relative occurrence", title="Unigram distribution")

hb<-head(bigrams, n=50)
qplot(hb$token, hb$occurrence/sum(bigrams$occurrence)) + geom_bar(position = "dodge", stat="identity")+coord_flip() + scale_x_discrete(limits=hb$token[order(hb$occurrence, decreasing=F)])+labs(x="Bigram token", y="Relative occurrence", title="Bigram distribution")

ht<-head(trigrams, n=50)
qplot(ht$token, ht$occurrence/sum(trigrams$occurrence)) + geom_bar(position = "dodge", stat="identity")+coord_flip() + scale_x_discrete(limits=ht$token[order(ht$occurrence, decreasing=F)])+labs(x="Trigram token", y="Relative occurrence", title="Trigram distribution")

Outlook

The n-grams were created unstemmed and without removing apostrophs. Thus, the bi- and tri-grams can be used to show a user meaningful proposals for next words, which should be grammatically correct.

The low coverage in case of trigrams (only 19% of the found trigrams occur more than 1 time) could be a problem for word proposals. It has to be analysed in more detail if the low coverage is mainly due to wrong spelling (this would increase the quality of the word proposals, because wrong variations are filtered out) or due to real variations which occur rarely. The used text corpus contains maybe domains too specific for better coverage. On the other hand, 65% coverage of the text corpus is reached with the used trigrams.

Reference

Feinerer, Ingo, and Kurt Hornik. 2015. “Tm: Text Mining Package.” http://CRAN.R-project.org/package=tm.

Kenneth, Benoit, Paul Nulty, Kohei Watanabe, Benjamin Lauderdale, Adam Obeng, and Pablo Barberá. 2016. “Quanteda: Quantitative Analysis of Textual Data.” http://CRAN.R-project.org/package=quanteda.