Introduction

The goal of the capstone project is to build a Shiny application based on a predictive text model using corpus data. The user would provide a word or short phrase and the application would try to predict the next word. Three corpus of English language from the Internet will be used to train the model (news, blogs, and tweets).

The purpose of this report is to present some preprocessing and exploratory data analysis that has been applied to the data.

Libaries

The following libraries are used

library(tibble)
library(gdata)
library(stringr)
library(knitr)
library(lattice)
library(gridExtra)
library(tm)
library(RWeka)
library(dplyr)
library(kableExtra)
library(wordcloud)

Acquisition and Read of the Data

We acquire and get the data

if (!file.exists("final")){
        zipfile="Coursera-SwiftKey.zip"
        if (!file.exists(zipfile)){
                dataurl="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
                download.file(dataurl,zipfile)
        }
        unzip(zipfile)
}

blogsfile="final/en_US/en_US.blogs.txt"
newsfile="final/en_US/en_US.news.txt"
twitterfile="final/en_US/en_US.twitter.txt"

blogs=readLines(blogsfile,skipNul=TRUE)
news=readLines(newsfile,skipNul=TRUE)
twits=readLines(twitterfile,skipNul=TRUE)

General Statistical Summary

We apply a basic statistical analysis to get to know the datasets a little bit better

char=sapply(list(blogs,news,twits),nchar)
word=sapply(list(blogs,news,twits),function(x){str_count(x,pattern="\\w+")})
stats=tibble(corpus=c("blogs","news","twitter"),filesize=sapply(list(blogsfile,newsfile,twitterfile), function(x){humanReadable(file.info(x)$size)}),memorysize=sapply(list(blogs,news,twits), function(x){humanReadable(object.size(x))}),nlines=sapply(list(blogs,news,twits),length),totalchar=sapply(char,sum),meanchar=sapply(char,mean),medianchar=sapply(char,median),maxchar=sapply(char,max),totalword=sapply(word,sum),meanword=sapply(word,mean),medianword=sapply(word,median),maxword=sapply(word,max))
kable(stats,col.names=c("Corpus","File size", "Size in memory", "Number of entries","Total number of characters","Mean number of characters","Median number of characters","Maximum number of characters","Total number of words","Mean number of words","Median number of words","Maximum number of words"))
Corpus File size Size in memory Number of entries Total number of characters Mean number of characters Median number of characters Maximum number of characters Total number of words Mean number of words Median number of words Maximum number of words
blogs 200.4 MiB 255.4 MiB 899288 206824505 229.98695 156 40833 38309620 42.59995 29 6851
news 196.3 MiB 257.3 MiB 1010242 203223159 201.16285 185 11384 35624454 35.26329 32 1928
twitter 159.4 MiB 319.0 MiB 2360148 162096241 68.68054 64 140 31003544 13.13627 12 47

Please note that total number of characters and total number of words are per corpus, whereas mean, median and maximum number of characters and words are per entry in each corpus.

We plot some histograms for the number of characters and the number of words for the blogs, news and Twitter datasets. Due to the wide range in terms of number of characters and words for the blogs and news dataset, the histogram scale has been plotted in decimal logarithmic scale. Regarding the Twitter dataset, that is not necessary thanks to the character limitation on Twitter posts.

histscal=list(x=list(at=c(0,1,2,3,4),label=c(1,10,100,1000,10000)))

plotcb=histogram(log10(char[[1]]),scales=histscal,main="Blog",xlab="Number of characters")
plotcn=histogram(log10(char[[2]]),scales=histscal,main="News",xlab="Number of characters")
plotct=histogram(char[[3]],main="Twitter",xlab="Number of characters")

plotwb=histogram(log10(word[[1]]),scales=histscal,main="Blog",xlab="Number of words")
plotwn=histogram(log10(word[[2]]),scales=histscal,main="News",xlab="Number of words")
plotwt=histogram(word[[3]],main="Twitter",xlab="Number of words")

grid.arrange(plotcb,plotwb,ncol=2)

grid.arrange(plotcn,plotwn,ncol=2)

grid.arrange(plotct,plotwt,ncol=2)

Corpus processing and analysis

We have seen before than we have 899288 blog entries, 1010242 news entries, and 2360148 Twitter entries. These are a lot of entries and will be difficult to process with limited computer memory. We will therefore limit to a randomly extracted sample of 25000 entries in the blogs and news datasets, and of 50000 entries in the Twitter dataset.

set.seed(20201130)
blogs=sample(blogs,25000)
news=sample(news,25000)
twits=sample(twits,50000)

We will use the tm package to convert the three datasets into proper corpora and we will do some cleaning and prepocessing of the data. We will use two functions for this process. With the first, we

makecorpus <- function(x){
        corpus=VCorpus(VectorSource(x))
        corpus=tm_map(corpus, function(y) gsub("-", " ", y))
        corpus=tm_map(corpus, removePunctuation)
        corpus=tm_map(corpus, removeNumbers)
        corpus=tm_map(corpus, tolower)
        corpus=tm_map(corpus, stripWhitespace)
}

We will use the second function to

processcorpus <- function(corpus){
        corpus=tm_map(corpus,removeWords,stopwords("english"))
        corpus=tm_map(corpus,stemDocument)
}

We apply the techniques consecutively to the three datasets

corpus2blogs=processcorpus(makecorpus(blogs))
corpus2news=processcorpus(makecorpus(news))
corpus2twits=processcorpus(makecorpus(twits))

Word frequencies

One of the first approaches to study quantitatively any kind of corpora is considering the occurrence frequencies of words. We will consider de corpora after we have removed stop words and desinences, which will give a more realistic vision of which lexical terms are the most common. Using the tm and RWeka packages, we will create a tokenizing function and create an occurrence matrix for each data set and extract the most significant terms. We then extract the frequency of appearance of each term. We will limit to the 100 most frequent terms.

token1 <- function(x){
        NGramTokenizer(x, Weka_control(min = 1, max = 1))
}

corpusblogsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2blogs,PlainTextDocument), control = list(tokenize = token1))))
corpusblogsfreq1g <- data.frame(word=names(corpusblogsfreq1g), frequency=corpusblogsfreq1g,row.names = c())
corpusblogsfreq1g=arrange(corpusblogsfreq1g,desc(frequency))[1:100,]

corpusnewsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2news,PlainTextDocument), control = list(tokenize = token1))))
corpusnewsfreq1g <- data.frame(word=names(corpusnewsfreq1g), frequency=corpusnewsfreq1g,row.names = c())
corpusnewsfreq1g=arrange(corpusnewsfreq1g,desc(frequency))[1:100,]

corpustwitsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2twits,PlainTextDocument), control = list(tokenize = token1))))
corpustwitsfreq1g <- data.frame(word=names(corpustwitsfreq1g), frequency=corpustwitsfreq1g,row.names = c())
corpustwitsfreq1g=arrange(corpustwitsfreq1g,desc(frequency))[1:100,]

The top 10 words with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq1g[1:10,],corpusnewsfreq1g[1:10,],corpustwitsfreq1g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
Blogs
News
Twitter
word frequency word frequency word frequency
one 3726 said 6156 just 3191
will 3150 year 3166 get 3063
time 3025 will 2764 thank 2818
like 3001 one 2262 like 2800
just 2758 new 1792 love 2595
can 2757 time 1777 day 2321
get 2659 two 1604 good 2142
make 2305 state 1603 will 2041
year 2008 like 1524 one 1925
know 1952 get 1483 can 1898

Word clouds showing the occurrence of the top 25 words in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)

Bigrams

We will now consider the occurrence of bigrams, appeareances of two words together, which could play an important role in the prediction model. Due to memory constrains, we will further decrease the size of the datasets to 5000 entries for the blogs and news entries and to 10000 entries for the Twitter dataset

blogs=sample(blogs,5000)
news=sample(news,5000)
twits=sample(twits,10000)

Now to create the corpora, we don’t remove stop words, as doing so could reduce the significance of bigrams. For the same reason, we do not reduce all the forms of a word to its root or stem.

corpusblogs=makecorpus(blogs)
corpusnews=makecorpus(news)
corpustwits=makecorpus(twits)

We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.

token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

corpusblogsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token2))))
corpusblogsfreq2g=data.frame(word=names(corpusblogsfreq2g), frequency=corpusblogsfreq2g,row.names = c())
corpusblogsfreq2g=arrange(corpusblogsfreq2g,desc(frequency))[1:100,]

corpusnewsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token2))))
corpusnewsfreq2g=data.frame(word=names(corpusnewsfreq2g), frequency=corpusnewsfreq2g,row.names = c())
corpusnewsfreq2g=arrange(corpusnewsfreq2g,desc(frequency))[1:100,]

corpustwitsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token2))))
corpustwitsfreq2g=data.frame(word=names(corpustwitsfreq2g), frequency=corpustwitsfreq2g,row.names = c())
corpustwitsfreq2g=arrange(corpustwitsfreq2g,desc(frequency))[1:100,]

The top 10 bigrams with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq2g[1:10,],corpusnewsfreq2g[1:10,],corpustwitsfreq2g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
Blogs
News
Twitter
word frequency word frequency word frequency
of the 1068 of the 979 for the 312
in the 829 in the 862 in the 309
to the 493 to the 412 of the 241
on the 416 for the 355 on the 206
to be 356 on the 343 to be 192
for the 337 at the 278 to the 182
and i 294 in a 273 at the 161
and the 287 and the 271 thanks for 160
i was 268 to be 226 for a 150
at the 259 from the 208 thank you 144

Word clouds showing the occurrence of the top 25 bigrams in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)

Trigrams

We will now consider the occurrence of trigrams, appeareances of three words together, which could also play an important role in the prediction model using the same corpora as for the bigram analysis. We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.

token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

corpusblogsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token3))))
corpusblogsfreq3g=data.frame(word=names(corpusblogsfreq3g), frequency=corpusblogsfreq3g,row.names = c())
corpusblogsfreq3g=arrange(corpusblogsfreq3g,desc(frequency))[1:100,]

corpusnewsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token3))))
corpusnewsfreq3g=data.frame(word=names(corpusnewsfreq3g), frequency=corpusnewsfreq3g,row.names = c())
corpusnewsfreq3g=arrange(corpusnewsfreq3g,desc(frequency))[1:100,]

corpustwitsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token3))))
corpustwitsfreq3g=data.frame(word=names(corpustwitsfreq3g), frequency=corpustwitsfreq3g,row.names = c())
corpustwitsfreq3g=arrange(corpustwitsfreq3g,desc(frequency))[1:100,]

The top 10 trigrams with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq3g[1:10,],corpusnewsfreq3g[1:10,],corpustwitsfreq3g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
Blogs
News
Twitter
word frequency word frequency word frequency
one of the 74 one of the 78 thanks for the 85
a lot of 69 a lot of 45 thank you for 43
as well as 45 according to the 35 for the follow 37
some of the 42 to be a 33 looking forward to 33
i want to 38 as well as 28 i want to 31
the end of 38 some of the 27 a lot of 27
it was a 37 in the first 24 i love you 27
going to be 36 the end of 24 cant wait to 26
to be a 35 a year old 23 im going to 26
i wanted to 34 for the first 22 one of the 26

Word clouds showing the occurrence of the top 25 trigrams in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)