This report provides the summary of the data preparation process and exploratory analysis done on the Swiftkey corpora consisting of blogs, news and twitter feeds provided by Coursera, with the final objective of building a predictive text model that predcits the next word in a sequence. R packages written for Natural Language Processing, namely tm and rweka are used to create the corpus and then to build Ngrams (Uni, Bi, Tri and Quadragrams). The results of these Ngrams are summarized into dataframes that are used to create summary tables, barplots and word clouds. The moldeling steps envisaged are also stated in the end.
The files are downloaded and unzipped. For this project, I am limiting the analyses to the English language files in the en_US directory, namely, the 3 files en_US.blogs.txt, en_US.blogs.txt and en_US.twitter.txt. The files are read in using the appropriate Unicode encoding and conversions. A 5% sample of each of these files is written into /cleanedSamples directory. Since the memory needs for processing Ngram tokenizers are very high, I had to settle for a small sample size.
#Read in US.blogs and write out sample
USblogsRaw <- file("./en_US/en_US.blogs.txt", open="rb")
USblogs <- readLines(USblogsRaw, skipNul=TRUE, encoding="UTF-8")
close(USblogsRaw)
USblogs <- iconv(USblogs, "latin1", "ASCII", sub="")
rm(USblogsRaw)
set.seed(123)
sampBlogs <- sample(USblogs, 0.05*length(USblogs))
write(sampBlogs, file="./cleanedSamples/en_US.blogs.txt")
The same kind of processing is done on the News and Tweets files and samples written out (code chunk hidden using echo=FALSE option)
The word count, the mean word count and the number of lines for each of the 3 files are output below:
library(stringi)
#Count words and lines
WordCountBlogs <- stri_count_words(USblogs)
WordCountNews <- stri_count_words(USnews)
WordCountTweets <- stri_count_words(UStweets)
#output a summary table
data.frame(filename = c("USblogs","USnews","UStweets"),
LineCount = c(length(USblogs),length(USnews),length(UStweets)),
WordCount = c(sum(WordCountBlogs), sum(WordCountNews), sum(WordCountTweets)),
MeanWordCount =c(round(mean(WordCountBlogs),2), round(mean(WordCountNews), 2) ,round(mean(WordCountTweets),2)))
## filename LineCount WordCount MeanWordCount
## 1 USblogs 899288 37510168 41.71
## 2 USnews 1010242 34749301 34.40
## 3 UStweets 2360148 30088605 12.75
All the intermediate data sets are removed in order to clear memory(code hidden).
The VCorpus function in the tm package is used for tokenization into one single corpus. Subsequently, the tm_map function is used for the necessary conversions as in the code below.
library(tm)
UScorpus <- VCorpus(DirSource("./cleanedSamples", encoding = "UTF-8"), readerControl = list(language= "en"))
UScorpus <- tm_map(UScorpus, tolower)
UScorpus <- tm_map(UScorpus, stripWhitespace)
UScorpus <- tm_map(UScorpus, removePunctuation)
UScorpus <- tm_map(UScorpus, removeNumbers)
UScorpus <- tm_map(UScorpus, PlainTextDocument)
Since the objective is to predict the next words, removing stop words is not appropriate here, as the next word could be different based on the stopwords. Similarly stemming is also not done here, because that would strip the contextual information from the word, important for predicting the next word.
The NGramTokenizer function in the RWeka package is used to create the Ngrams. Separate Ngrams are built for the bolgs, news and twitter files and then then rolled up into data frames, which are then merged to create the unigram, bigram, trigram and quadragram data frames.
library(RWeka)
#Caret functions for Ngram tokenization
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min =3, max =3))
QuadragramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min =4, max =4))
To work around the massive memory resources needed for Ngram tokenization, each of the corpora (blogs, news and tweets) were processed separately, converted to data frames and then merged to get the Ngram dataframe. After each dataframe is created, all the intermediate datasets were removed to clear space (code hidden)
The code for Unigram tokenization is shown below:
#Create Unigrams for blogs, news and twitter separately and merge the dataframes to UnigramData
blogsUnigram<- DocumentTermMatrix(UScorpus[1], control = list(tokenize = UnigramTokenizer))
newsUnigram<- DocumentTermMatrix(UScorpus[2], control = list(tokenize = UnigramTokenizer))
tweetsUnigram<- DocumentTermMatrix(UScorpus[3], control = list(tokenize = UnigramTokenizer))
blogsUnimat <- sort(colSums(as.matrix(blogsUnigram)), decreasing=TRUE)
blogsUniData <- data.frame(word = names(blogsUnimat), freq = blogsUnimat)
newsUnimat <- sort(colSums(as.matrix(newsUnigram)), decreasing=TRUE)
newsUniData <- data.frame(word = names(newsUnimat), freq = newsUnimat)
tweetsUnimat <- sort(colSums(as.matrix(tweetsUnigram)), decreasing=TRUE)
tweetsUniData <- data.frame(word = names(tweetsUnimat), freq = tweetsUnimat)
UnigramData <- merge(merge(blogsUniData, newsUniData, by = "word", x.all = TRUE, y.all =TRUE), tweetsUniData, by = "word", x.all=TRUE, y.all=TRUE)
UnigramData$frequency = UnigramData$freq.x + UnigramData$freq.y + UnigramData$freq
UnigramData <- subset(UnigramData, select = -c(freq.x, freq.y, freq))
UnigramData <- UnigramData[order(-UnigramData$frequency),]
The code chunks for the Bigram, Trigram and Quadragram tokenization are not being displayed, in order to keep the report concise.
#Unigram
head(UnigramData, n = 10)
## word frequency
## 21222 the 238146
## 749 and 119907
## 8278 for 54865
## 21214 that 52209
## 23591 you 47015
## 23339 with 35723
## 22936 was 31236
## 21298 this 27150
## 9637 have 26576
## 1015 are 24535
#Bigram
head(BigramData, n = 10)
## word frequency
## 46733 of the 21463
## 33491 in the 20712
## 71719 to the 10727
## 24809 for the 10035
## 47688 on the 9683
## 70277 to be 8073
## 9503 at the 7196
## 7251 and the 6286
## 32610 in a 5892
## 78509 with the 5259
#Trigram
head(TrigramData, n = 10)
## word frequency
## 27597 one of the 1729
## 767 a lot of 1506
## 32250 thanks for the 1188
## 38037 to be a 873
## 12773 going to be 858
## 33772 the end of 747
## 16607 i want to 731
## 20514 it was a 724
## 28147 out of the 716
## 5006 as well as 707
#Quadragram
head(QuadragramData, n = 10)
## word frequency
## 7338 the end of the 373
## 7690 the rest of the 368
## 1193 at the end of 315
## 2282 for the first time 312
## 1214 at the same time 253
## 6264 one of the most 231
## 4506 is going to be 221
## 4298 in the middle of 207
## 9811 when it comes to 205
## 4589 is one of the 204
library(ggplot2)
#Barplot for Unigram
ggplot(UnigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
geom_bar(stat = "identity", fill = "brown") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
labs(title = "Top Unigrams by Frequency")
#Barplot for bigram
ggplot(BigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
geom_bar(stat = "identity", fill = "brown") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
labs(title = "Top Bigrams by Frequency")
#Barplot for Trigram
ggplot(TrigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
geom_bar(stat = "identity", fill = "brown") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
labs(title = "Top Trigrams by Frequency")
#Barplot for Quadragram
ggplot(QuadragramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
geom_bar(stat = "identity", fill = "brown") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Quadragram") + ylab("Frequency") +
labs(title = "Top Quadragrams by Frequency")
Word Clouds with the top 50 words for each of the Ngrams are shown below
library(wordcloud)
#Unigram Word Cloud
wordcloud(UnigramData[1:50,]$word, UnigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))
#Bigram Word Cloud
wordcloud(BigramData[1:50,]$word, BigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))
#Trigram Word Cloud
wordcloud(TrigramData[1:50,]$word, TrigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))
#Quadragram Word Cloud
wordcloud(QuadragramData[1:50,]$word, QuadragramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))
The next step is to build a prediction model using a combination of bigrams, trigrams and quadragrams. An appropriate smoothing technique may be chosen between Stupid-Backoff, Good-Turing and Kneser-Ney, based on further study and analyses.