Milestone Report

Introduction

The goal of the capstone project is to build a Shiny application based on a predictive text model using corpus data. The user would provide a word or short phrase and the application would try to predict the next word. Three corpus of English language from the Internet will be used to train the model (news, blogs, and tweets).

The purpose of this report is to present some preprocessing and exploratory data analysis that has been applied to the data.

Libaries

The following libraries are used

library(tibble)
library(gdata)
library(stringr)
library(knitr)
library(lattice)
library(gridExtra)
library(tm)
library(RWeka)
library(dplyr)
library(kableExtra)
library(wordcloud)

Acquisition and Read of the Data

We acquire and get the data

if (!file.exists("final")){
        zipfile="Coursera-SwiftKey.zip"
        if (!file.exists(zipfile)){
                dataurl="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
                download.file(dataurl,zipfile)
        }
        unzip(zipfile)
}

blogsfile="final/en_US/en_US.blogs.txt"
newsfile="final/en_US/en_US.news.txt"
twitterfile="final/en_US/en_US.twitter.txt"

blogs=readLines(blogsfile,skipNul=TRUE)
news=readLines(newsfile,skipNul=TRUE)
twits=readLines(twitterfile,skipNul=TRUE)

General Statistical Summary

We apply a basic statistical analysis to get to know the datasets a little bit better

char=sapply(list(blogs,news,twits),nchar)
word=sapply(list(blogs,news,twits),function(x){str_count(x,pattern="\\w+")})
stats=tibble(corpus=c("blogs","news","twitter"),filesize=sapply(list(blogsfile,newsfile,twitterfile), function(x){humanReadable(file.info(x)$size)}),memorysize=sapply(list(blogs,news,twits), function(x){humanReadable(object.size(x))}),nlines=sapply(list(blogs,news,twits),length),totalchar=sapply(char,sum),meanchar=sapply(char,mean),medianchar=sapply(char,median),maxchar=sapply(char,max),totalword=sapply(word,sum),meanword=sapply(word,mean),medianword=sapply(word,median),maxword=sapply(word,max))
kable(stats,col.names=c("Corpus","File size", "Size in memory", "Number of entries","Total number of characters","Mean number of characters","Median number of characters","Maximum number of characters","Total number of words","Mean number of words","Median number of words","Maximum number of words"))

Corpus	File size	Size in memory	Number of entries	Total number of characters	Mean number of characters	Median number of characters	Maximum number of characters	Total number of words	Mean number of words	Median number of words	Maximum number of words
blogs	200.4 MiB	255.4 MiB	899288	206824505	229.98695	156	40833	38309620	42.59995	29	6851
news	196.3 MiB	257.3 MiB	1010242	203223159	201.16285	185	11384	35624454	35.26329	32	1928
twitter	159.4 MiB	319.0 MiB	2360148	162096241	68.68054	64	140	31003544	13.13627	12	47

Please note that total number of characters and total number of words are per corpus, whereas mean, median and maximum number of characters and words are per entry in each corpus.

We plot some histograms for the number of characters and the number of words for the blogs, news and Twitter datasets. Due to the wide range in terms of number of characters and words for the blogs and news dataset, the histogram scale has been plotted in decimal logarithmic scale. Regarding the Twitter dataset, that is not necessary thanks to the character limitation on Twitter posts.

histscal=list(x=list(at=c(0,1,2,3,4),label=c(1,10,100,1000,10000)))

plotcb=histogram(log10(char[[1]]),scales=histscal,main="Blog",xlab="Number of characters")
plotcn=histogram(log10(char[[2]]),scales=histscal,main="News",xlab="Number of characters")
plotct=histogram(char[[3]],main="Twitter",xlab="Number of characters")

plotwb=histogram(log10(word[[1]]),scales=histscal,main="Blog",xlab="Number of words")
plotwn=histogram(log10(word[[2]]),scales=histscal,main="News",xlab="Number of words")
plotwt=histogram(word[[3]],main="Twitter",xlab="Number of words")

grid.arrange(plotcb,plotwb,ncol=2)

grid.arrange(plotcn,plotwn,ncol=2)

grid.arrange(plotct,plotwt,ncol=2)

Corpus processing and analysis

We have seen before than we have 899288 blog entries, 1010242 news entries, and 2360148 Twitter entries. These are a lot of entries and will be difficult to process with limited computer memory. We will therefore limit to a randomly extracted sample of 25000 entries in the blogs and news datasets, and of 50000 entries in the Twitter dataset.

set.seed(20201130)
blogs=sample(blogs,25000)
news=sample(news,25000)
twits=sample(twits,50000)

We will use the tm package to convert the three datasets into proper corpora and we will do some cleaning and prepocessing of the data. We will use two functions for this process. With the first, we

Convert hyphens to spaces to prevent word joining when removing punctuation later.
Remove punctuation
Remove numbers
Convert everything to lowercase
Remove extra blank spaces (in case there are more than one in a row)

makecorpus <- function(x){
        corpus=VCorpus(VectorSource(x))
        corpus=tm_map(corpus, function(y) gsub("-", " ", y))
        corpus=tm_map(corpus, removePunctuation)
        corpus=tm_map(corpus, removeNumbers)
        corpus=tm_map(corpus, tolower)
        corpus=tm_map(corpus, stripWhitespace)
}

We will use the second function to

Remove stop words (these are grammatical words without proper lexical meaning, such as articles, prepositions, conjunction, etc.)
Remove inflection, considering only the stem or root of each word.

processcorpus <- function(corpus){
        corpus=tm_map(corpus,removeWords,stopwords("english"))
        corpus=tm_map(corpus,stemDocument)
}

We apply the techniques consecutively to the three datasets

corpus2blogs=processcorpus(makecorpus(blogs))
corpus2news=processcorpus(makecorpus(news))
corpus2twits=processcorpus(makecorpus(twits))

Word frequencies

One of the first approaches to study quantitatively any kind of corpora is considering the occurrence frequencies of words. We will consider de corpora after we have removed stop words and desinences, which will give a more realistic vision of which lexical terms are the most common. Using the tm and RWeka packages, we will create a tokenizing function and create an occurrence matrix for each data set and extract the most significant terms. We then extract the frequency of appearance of each term. We will limit to the 100 most frequent terms.

token1 <- function(x){
        NGramTokenizer(x, Weka_control(min = 1, max = 1))
}

corpusblogsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2blogs,PlainTextDocument), control = list(tokenize = token1))))
corpusblogsfreq1g <- data.frame(word=names(corpusblogsfreq1g), frequency=corpusblogsfreq1g,row.names = c())
corpusblogsfreq1g=arrange(corpusblogsfreq1g,desc(frequency))[1:100,]

corpusnewsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2news,PlainTextDocument), control = list(tokenize = token1))))
corpusnewsfreq1g <- data.frame(word=names(corpusnewsfreq1g), frequency=corpusnewsfreq1g,row.names = c())
corpusnewsfreq1g=arrange(corpusnewsfreq1g,desc(frequency))[1:100,]

corpustwitsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2twits,PlainTextDocument), control = list(tokenize = token1))))
corpustwitsfreq1g <- data.frame(word=names(corpustwitsfreq1g), frequency=corpustwitsfreq1g,row.names = c())
corpustwitsfreq1g=arrange(corpustwitsfreq1g,desc(frequency))[1:100,]

The top 10 words with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq1g[1:10,],corpusnewsfreq1g[1:10,],corpustwitsfreq1g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))

Blogs		News		Twitter
word	frequency	word	frequency	word	frequency
one	3726	said	6156	just	3191
will	3150	year	3166	get	3063
time	3025	will	2764	thank	2818
like	3001	one	2262	like	2800
just	2758	new	1792	love	2595
can	2757	time	1777	day	2321
get	2659	two	1604	good	2142
make	2305	state	1603	will	2041
year	2008	like	1524	one	1925
know	1952	get	1483	can	1898

Word clouds showing the occurrence of the top 25 words in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)

Bigrams

We will now consider the occurrence of bigrams, appeareances of two words together, which could play an important role in the prediction model. Due to memory constrains, we will further decrease the size of the datasets to 5000 entries for the blogs and news entries and to 10000 entries for the Twitter dataset

blogs=sample(blogs,5000)
news=sample(news,5000)
twits=sample(twits,10000)

Now to create the corpora, we don’t remove stop words, as doing so could reduce the significance of bigrams. For the same reason, we do not reduce all the forms of a word to its root or stem.

corpusblogs=makecorpus(blogs)
corpusnews=makecorpus(news)
corpustwits=makecorpus(twits)

We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.

token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

corpusblogsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token2))))
corpusblogsfreq2g=data.frame(word=names(corpusblogsfreq2g), frequency=corpusblogsfreq2g,row.names = c())
corpusblogsfreq2g=arrange(corpusblogsfreq2g,desc(frequency))[1:100,]

corpusnewsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token2))))
corpusnewsfreq2g=data.frame(word=names(corpusnewsfreq2g), frequency=corpusnewsfreq2g,row.names = c())
corpusnewsfreq2g=arrange(corpusnewsfreq2g,desc(frequency))[1:100,]

corpustwitsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token2))))
corpustwitsfreq2g=data.frame(word=names(corpustwitsfreq2g), frequency=corpustwitsfreq2g,row.names = c())
corpustwitsfreq2g=arrange(corpustwitsfreq2g,desc(frequency))[1:100,]

The top 10 bigrams with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq2g[1:10,],corpusnewsfreq2g[1:10,],corpustwitsfreq2g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))

Blogs		News		Twitter
word	frequency	word	frequency	word	frequency
of the	1068	of the	979	for the	312
in the	829	in the	862	in the	309
to the	493	to the	412	of the	241
on the	416	for the	355	on the	206
to be	356	on the	343	to be	192
for the	337	at the	278	to the	182
and i	294	in a	273	at the	161
and the	287	and the	271	thanks for	160
i was	268	to be	226	for a	150
at the	259	from the	208	thank you	144

Word clouds showing the occurrence of the top 25 bigrams in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)

Trigrams

We will now consider the occurrence of trigrams, appeareances of three words together, which could also play an important role in the prediction model using the same corpora as for the bigram analysis. We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.

token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

corpusblogsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token3))))
corpusblogsfreq3g=data.frame(word=names(corpusblogsfreq3g), frequency=corpusblogsfreq3g,row.names = c())
corpusblogsfreq3g=arrange(corpusblogsfreq3g,desc(frequency))[1:100,]

corpusnewsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token3))))
corpusnewsfreq3g=data.frame(word=names(corpusnewsfreq3g), frequency=corpusnewsfreq3g,row.names = c())
corpusnewsfreq3g=arrange(corpusnewsfreq3g,desc(frequency))[1:100,]

corpustwitsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token3))))
corpustwitsfreq3g=data.frame(word=names(corpustwitsfreq3g), frequency=corpustwitsfreq3g,row.names = c())
corpustwitsfreq3g=arrange(corpustwitsfreq3g,desc(frequency))[1:100,]

The top 10 trigrams with the highest occurrence frequencies in each of the three datasets are

kbl(cbind(corpusblogsfreq3g[1:10,],corpusnewsfreq3g[1:10,],corpustwitsfreq3g[1:10,])) %>%
        kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))

Blogs		News		Twitter
word	frequency	word	frequency	word	frequency
one of the	74	one of the	78	thanks for the	85
a lot of	69	a lot of	45	thank you for	43
as well as	45	according to the	35	for the follow	37
some of the	42	to be a	33	looking forward to	33
i want to	38	as well as	28	i want to	31
the end of	38	some of the	27	a lot of	27
it was a	37	in the first	24	i love you	27
going to be	36	the end of	24	cant wait to	26
to be a	35	a year old	23	im going to	26
i wanted to	34	for the first	22	one of the	26

Word clouds showing the occurrence of the top 25 trigrams in each of the datasets are shown below

par(mfrow=c(1,3))
with(corpusblogsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)