The goal of the capstone project is to build a Shiny application based on a predictive text model using corpus data. The user would provide a word or short phrase and the application would try to predict the next word. Three corpus of English language from the Internet will be used to train the model (news, blogs, and tweets).
The purpose of this report is to present some preprocessing and exploratory data analysis that has been applied to the data.
The following libraries are used
library(tibble)
library(gdata)
library(stringr)
library(knitr)
library(lattice)
library(gridExtra)
library(tm)
library(RWeka)
library(dplyr)
library(kableExtra)
library(wordcloud)
We acquire and get the data
if (!file.exists("final")){
zipfile="Coursera-SwiftKey.zip"
if (!file.exists(zipfile)){
dataurl="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(dataurl,zipfile)
}
unzip(zipfile)
}
blogsfile="final/en_US/en_US.blogs.txt"
newsfile="final/en_US/en_US.news.txt"
twitterfile="final/en_US/en_US.twitter.txt"
blogs=readLines(blogsfile,skipNul=TRUE)
news=readLines(newsfile,skipNul=TRUE)
twits=readLines(twitterfile,skipNul=TRUE)
We apply a basic statistical analysis to get to know the datasets a little bit better
char=sapply(list(blogs,news,twits),nchar)
word=sapply(list(blogs,news,twits),function(x){str_count(x,pattern="\\w+")})
stats=tibble(corpus=c("blogs","news","twitter"),filesize=sapply(list(blogsfile,newsfile,twitterfile), function(x){humanReadable(file.info(x)$size)}),memorysize=sapply(list(blogs,news,twits), function(x){humanReadable(object.size(x))}),nlines=sapply(list(blogs,news,twits),length),totalchar=sapply(char,sum),meanchar=sapply(char,mean),medianchar=sapply(char,median),maxchar=sapply(char,max),totalword=sapply(word,sum),meanword=sapply(word,mean),medianword=sapply(word,median),maxword=sapply(word,max))
kable(stats,col.names=c("Corpus","File size", "Size in memory", "Number of entries","Total number of characters","Mean number of characters","Median number of characters","Maximum number of characters","Total number of words","Mean number of words","Median number of words","Maximum number of words"))
| Corpus | File size | Size in memory | Number of entries | Total number of characters | Mean number of characters | Median number of characters | Maximum number of characters | Total number of words | Mean number of words | Median number of words | Maximum number of words |
|---|---|---|---|---|---|---|---|---|---|---|---|
| blogs | 200.4 MiB | 255.4 MiB | 899288 | 206824505 | 229.98695 | 156 | 40833 | 38309620 | 42.59995 | 29 | 6851 |
| news | 196.3 MiB | 257.3 MiB | 1010242 | 203223159 | 201.16285 | 185 | 11384 | 35624454 | 35.26329 | 32 | 1928 |
| 159.4 MiB | 319.0 MiB | 2360148 | 162096241 | 68.68054 | 64 | 140 | 31003544 | 13.13627 | 12 | 47 |
Please note that total number of characters and total number of words are per corpus, whereas mean, median and maximum number of characters and words are per entry in each corpus.
We plot some histograms for the number of characters and the number of words for the blogs, news and Twitter datasets. Due to the wide range in terms of number of characters and words for the blogs and news dataset, the histogram scale has been plotted in decimal logarithmic scale. Regarding the Twitter dataset, that is not necessary thanks to the character limitation on Twitter posts.
histscal=list(x=list(at=c(0,1,2,3,4),label=c(1,10,100,1000,10000)))
plotcb=histogram(log10(char[[1]]),scales=histscal,main="Blog",xlab="Number of characters")
plotcn=histogram(log10(char[[2]]),scales=histscal,main="News",xlab="Number of characters")
plotct=histogram(char[[3]],main="Twitter",xlab="Number of characters")
plotwb=histogram(log10(word[[1]]),scales=histscal,main="Blog",xlab="Number of words")
plotwn=histogram(log10(word[[2]]),scales=histscal,main="News",xlab="Number of words")
plotwt=histogram(word[[3]],main="Twitter",xlab="Number of words")
grid.arrange(plotcb,plotwb,ncol=2)
grid.arrange(plotcn,plotwn,ncol=2)
grid.arrange(plotct,plotwt,ncol=2)
We have seen before than we have 899288 blog entries, 1010242 news entries, and 2360148 Twitter entries. These are a lot of entries and will be difficult to process with limited computer memory. We will therefore limit to a randomly extracted sample of 25000 entries in the blogs and news datasets, and of 50000 entries in the Twitter dataset.
set.seed(20201130)
blogs=sample(blogs,25000)
news=sample(news,25000)
twits=sample(twits,50000)
We will use the tm package to convert the three datasets into proper corpora and we will do some cleaning and prepocessing of the data. We will use two functions for this process. With the first, we
makecorpus <- function(x){
corpus=VCorpus(VectorSource(x))
corpus=tm_map(corpus, function(y) gsub("-", " ", y))
corpus=tm_map(corpus, removePunctuation)
corpus=tm_map(corpus, removeNumbers)
corpus=tm_map(corpus, tolower)
corpus=tm_map(corpus, stripWhitespace)
}
We will use the second function to
processcorpus <- function(corpus){
corpus=tm_map(corpus,removeWords,stopwords("english"))
corpus=tm_map(corpus,stemDocument)
}
We apply the techniques consecutively to the three datasets
corpus2blogs=processcorpus(makecorpus(blogs))
corpus2news=processcorpus(makecorpus(news))
corpus2twits=processcorpus(makecorpus(twits))
One of the first approaches to study quantitatively any kind of corpora is considering the occurrence frequencies of words. We will consider de corpora after we have removed stop words and desinences, which will give a more realistic vision of which lexical terms are the most common. Using the tm and RWeka packages, we will create a tokenizing function and create an occurrence matrix for each data set and extract the most significant terms. We then extract the frequency of appearance of each term. We will limit to the 100 most frequent terms.
token1 <- function(x){
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
corpusblogsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2blogs,PlainTextDocument), control = list(tokenize = token1))))
corpusblogsfreq1g <- data.frame(word=names(corpusblogsfreq1g), frequency=corpusblogsfreq1g,row.names = c())
corpusblogsfreq1g=arrange(corpusblogsfreq1g,desc(frequency))[1:100,]
corpusnewsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2news,PlainTextDocument), control = list(tokenize = token1))))
corpusnewsfreq1g <- data.frame(word=names(corpusnewsfreq1g), frequency=corpusnewsfreq1g,row.names = c())
corpusnewsfreq1g=arrange(corpusnewsfreq1g,desc(frequency))[1:100,]
corpustwitsfreq1g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpus2twits,PlainTextDocument), control = list(tokenize = token1))))
corpustwitsfreq1g <- data.frame(word=names(corpustwitsfreq1g), frequency=corpustwitsfreq1g,row.names = c())
corpustwitsfreq1g=arrange(corpustwitsfreq1g,desc(frequency))[1:100,]
The top 10 words with the highest occurrence frequencies in each of the three datasets are
kbl(cbind(corpusblogsfreq1g[1:10,],corpusnewsfreq1g[1:10,],corpustwitsfreq1g[1:10,])) %>%
kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
|
Blogs
|
News
|
Twitter
|
|||
|---|---|---|---|---|---|
| word | frequency | word | frequency | word | frequency |
| one | 3726 | said | 6156 | just | 3191 |
| will | 3150 | year | 3166 | get | 3063 |
| time | 3025 | will | 2764 | thank | 2818 |
| like | 3001 | one | 2262 | like | 2800 |
| just | 2758 | new | 1792 | love | 2595 |
| can | 2757 | time | 1777 | day | 2321 |
| get | 2659 | two | 1604 | good | 2142 |
| make | 2305 | state | 1603 | will | 2041 |
| year | 2008 | like | 1524 | one | 1925 |
| know | 1952 | get | 1483 | can | 1898 |
Word clouds showing the occurrence of the top 25 words in each of the datasets are shown below
par(mfrow=c(1,3))
with(corpusblogsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq1g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)
We will now consider the occurrence of bigrams, appeareances of two words together, which could play an important role in the prediction model. Due to memory constrains, we will further decrease the size of the datasets to 5000 entries for the blogs and news entries and to 10000 entries for the Twitter dataset
blogs=sample(blogs,5000)
news=sample(news,5000)
twits=sample(twits,10000)
Now to create the corpora, we don’t remove stop words, as doing so could reduce the significance of bigrams. For the same reason, we do not reduce all the forms of a word to its root or stem.
corpusblogs=makecorpus(blogs)
corpusnews=makecorpus(news)
corpustwits=makecorpus(twits)
We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpusblogsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token2))))
corpusblogsfreq2g=data.frame(word=names(corpusblogsfreq2g), frequency=corpusblogsfreq2g,row.names = c())
corpusblogsfreq2g=arrange(corpusblogsfreq2g,desc(frequency))[1:100,]
corpusnewsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token2))))
corpusnewsfreq2g=data.frame(word=names(corpusnewsfreq2g), frequency=corpusnewsfreq2g,row.names = c())
corpusnewsfreq2g=arrange(corpusnewsfreq2g,desc(frequency))[1:100,]
corpustwitsfreq2g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token2))))
corpustwitsfreq2g=data.frame(word=names(corpustwitsfreq2g), frequency=corpustwitsfreq2g,row.names = c())
corpustwitsfreq2g=arrange(corpustwitsfreq2g,desc(frequency))[1:100,]
The top 10 bigrams with the highest occurrence frequencies in each of the three datasets are
kbl(cbind(corpusblogsfreq2g[1:10,],corpusnewsfreq2g[1:10,],corpustwitsfreq2g[1:10,])) %>%
kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
|
Blogs
|
News
|
Twitter
|
|||
|---|---|---|---|---|---|
| word | frequency | word | frequency | word | frequency |
| of the | 1068 | of the | 979 | for the | 312 |
| in the | 829 | in the | 862 | in the | 309 |
| to the | 493 | to the | 412 | of the | 241 |
| on the | 416 | for the | 355 | on the | 206 |
| to be | 356 | on the | 343 | to be | 192 |
| for the | 337 | at the | 278 | to the | 182 |
| and i | 294 | in a | 273 | at the | 161 |
| and the | 287 | and the | 271 | thanks for | 160 |
| i was | 268 | to be | 226 | for a | 150 |
| at the | 259 | from the | 208 | thank you | 144 |
Word clouds showing the occurrence of the top 25 bigrams in each of the datasets are shown below
par(mfrow=c(1,3))
with(corpusblogsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq2g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)
We will now consider the occurrence of trigrams, appeareances of three words together, which could also play an important role in the prediction model using the same corpora as for the bigram analysis. We apply the same technique to calculate the frequencies of bigrams as the one used before for unigrams.
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpusblogsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusblogs,PlainTextDocument), control = list(tokenize = token3))))
corpusblogsfreq3g=data.frame(word=names(corpusblogsfreq3g), frequency=corpusblogsfreq3g,row.names = c())
corpusblogsfreq3g=arrange(corpusblogsfreq3g,desc(frequency))[1:100,]
corpusnewsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpusnews,PlainTextDocument), control = list(tokenize = token3))))
corpusnewsfreq3g=data.frame(word=names(corpusnewsfreq3g), frequency=corpusnewsfreq3g,row.names = c())
corpusnewsfreq3g=arrange(corpusnewsfreq3g,desc(frequency))[1:100,]
corpustwitsfreq3g=rowSums(as.matrix(TermDocumentMatrix(tm_map(corpustwits,PlainTextDocument), control = list(tokenize = token3))))
corpustwitsfreq3g=data.frame(word=names(corpustwitsfreq3g), frequency=corpustwitsfreq3g,row.names = c())
corpustwitsfreq3g=arrange(corpustwitsfreq3g,desc(frequency))[1:100,]
The top 10 trigrams with the highest occurrence frequencies in each of the three datasets are
kbl(cbind(corpusblogsfreq3g[1:10,],corpusnewsfreq3g[1:10,],corpustwitsfreq3g[1:10,])) %>%
kable_classic(full_width=FALSE) %>% add_header_above(c("Blogs"=2,"News"=2,"Twitter"=2))
|
Blogs
|
News
|
Twitter
|
|||
|---|---|---|---|---|---|
| word | frequency | word | frequency | word | frequency |
| one of the | 74 | one of the | 78 | thanks for the | 85 |
| a lot of | 69 | a lot of | 45 | thank you for | 43 |
| as well as | 45 | according to the | 35 | for the follow | 37 |
| some of the | 42 | to be a | 33 | looking forward to | 33 |
| i want to | 38 | as well as | 28 | i want to | 31 |
| the end of | 38 | some of the | 27 | a lot of | 27 |
| it was a | 37 | in the first | 24 | i love you | 27 |
| going to be | 36 | the end of | 24 | cant wait to | 26 |
| to be a | 35 | a year old | 23 | im going to | 26 |
| i wanted to | 34 | for the first | 22 | one of the | 26 |
Word clouds showing the occurrence of the top 25 trigrams in each of the datasets are shown below
par(mfrow=c(1,3))
with(corpusblogsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Blogs",side=3,cex=2)
with(corpusnewsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("News",side=3,cex=2)
with(corpustwitsfreq3g,wordcloud(word,frequency,random.order = FALSE,maxwords=25,rot.per=0,fixed.asp=TRUE,use.r.layout=FALSE,scale=c(4,1),colors=brewer.pal(8,"Dark2")))
mtext("Twitter",side=3,cex=2)