Introduction

This is a milestone report for the Coursera Data Science Capstone Project. I will do some exploratory analysis concerning the three data sets in question and in consequence report some interesting findings.

Some Exploratory Analysis

We first read in the three text files:

blogs<-readLines("Swift/final/en_US/en_US.blogs.txt", encoding="UTF-8")
con<-file("Swift/final/en_US/en_US.news.txt", open="rb")
news<-readLines(con, encoding="UTF-8")
close(con)
rm(con)
twitter<-readLines("Swift/final/en_US/en_US.twitter.txt", encoding="UTF-8")

Getting an idea of the size of the respective files , we set:

blogs_size<-file.info("Swift/final/en_US/en_US.blogs.txt")$size/1024/1024
news_size<-file.info("Swift/final/en_US/en_US.news.txt")$size/1024/1024
twitter_size<-file.info("Swift/final/en_US/en_US.twitter.txt")$size/1024/1024

The following statements retrieve the number of lines in each text:

blogs_lineNum<-length(blogs)
news_lineNum<-length(news)
twitter_lineNum<-length(twitter)

With this I intent to get the respective number of words by using an algorithm which counts the entities between spaces:

blogs_wCount<-<-sum(sapply(gregexpr("\\S+",blogs),length))
news_wCount<-<-sum(sapply(gregexpr("\\S+",news),length))
twitter_wCount<-sum(sapply(gregexpr("\\S+",twitter),length))

This leads to the following results: The “blogs” file’s size is about 200.42 Mb, it contains 899288 lines and about 37334131 words. Concerning the “news” file: size is about 196.28 Mb, it contains 1010242 lines and about 34372530 words. The “twitter” file’s size is about 159.36 Mb , it contains 2360148 lines and around 30373543 words.

Creating a Sample

Due to the fact of mere capacity of the given files, I create a smaller sample of it, containing about only 1% of information compared to the original data sets:

sample_blogs<-blogs[sample(1:length(blogs),0.01*length(blogs))]
sample_news<-news[sample(1:length(news),0.01*length(news))]
sample_twitter<-twitter[sample(1:length(twitter),0.01*length(twitter))]

sample_text<-c(sample_blogs,sample_news,sample_twitter)

Processing of the data (including some cleaning):

library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
## some cleaning
sample_text<-iconv(sample_text,"latin1","ASCII",sub="")
sample_text<-VectorSource(sample_text)
sample_text<-Corpus(sample_text)
sample_text<-tm_map(sample_text,tolower)
sample_text<-tm_map(sample_text, removePunctuation)
sample_text<-tm_map(sample_text, removeNumbers)
sample_text<-tm_map(sample_text,stripWhitespace)
sample_text<-tm_map(sample_text,PlainTextDocument)

Visualizing of the data

A nice tool for visualizing word frequency is “Wordcloud”. The result looks like this:

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.3
wordcloud(sample_text,max.words=120)

The word cloud gives a nice idea about the distribution of different words in the text corpus- it is not very precise, though.

Further examining of the data

I will deal with percentages in the next step- more precisely, this means that I calculate the percentage of the occurrence of the respective words in the sample text, starting with the most frequent words (they are arranged in descending order concerning their frequency). I also calculate the cumulative percent rate. Due to the low capacity of my computer, it was not possible to use the whole data set for the next step. So I used a limited one to make the idea (and the respective algorithm) clear:

library(tm)
a<-sample_text[1:900]
b<-sample_text[21300:22200]
c<-sample_text[38300:39200]
sample_text2<-c(a,b,c)


sample_text2<-iconv(sample_text2,"latin1","ASCII",sub="")
sample_text2<-VectorSource(sample_text2)
sample_text2<-Corpus(sample_text2)
sample_text2<-tm_map(sample_text2,tolower)
sample_text2<-tm_map(sample_text2, removePunctuation)
sample_text2<-tm_map(sample_text2, removeNumbers)
sample_text2<-tm_map(sample_text2,stripWhitespace)
sample_text<-tm_map(sample_text2,PlainTextDocument)
library(tm)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dtm<-DocumentTermMatrix(sample_text)
dtm<-as.matrix(dtm)
freq<-colSums(dtm)
freq<-sort(freq,decreasing = TRUE)
cum<-as.data.frame(freq)
cum$perc<-freq/sum(freq)*100
cum$cum[1]<-cum$perc[1]
for(i in 2:nrow(cum)){cum$cum[i]<-cum$perc[i]+cum$cum[i-1]}
head(cum,25)
##       freq      perc       cum
## the   2581 5.6621986  5.662199
## and   1365 2.9945374  8.656736
## you    744 1.6321874 10.288924
## for    631 1.3842880 11.673212
## that   604 1.3250554 12.998267
## with   407 0.8928767 13.891144
## this   388 0.8511945 14.742338
## have   342 0.7502797 15.492618
## was    325 0.7129851 16.205603
## are    320 0.7020161 16.907619
## but    316 0.6932409 17.600860
## all    261 0.5725819 18.173442
## not    249 0.5462563 18.719698
## just   243 0.5330935 19.252792
## they   217 0.4760547 19.728846
## what   214 0.4694733 20.198320
## your   214 0.4694733 20.667793
## from   212 0.4650857 21.132878
## its    211 0.4628919 21.595770
## out    207 0.4541167 22.049887
## one    187 0.4102407 22.460128
## will   185 0.4058531 22.865981
## like   184 0.4036593 23.269640
## about  176 0.3861089 23.655749
## can    174 0.3817213 24.037470
nwords<-as.matrix(dim(cum))[1,]
cum1<-mutate(cum,ind=1:nwords)

There are only about 240 (different) words needed to cover 50% of the text corpus:

cum1[237:242,]
##     freq       perc      cum ind
## 237   24 0.05265121 49.93528 237
## 238   24 0.05265121 49.98793 238
## 239   24 0.05265121 50.04059 239
## 240   24 0.05265121 50.09324 240
## 241   24 0.05265121 50.14589 241
## 242   24 0.05265121 50.19854 242

We can also visualize the most frequent words in a barplot:

barplot(cum[1:25,2], names.arg=rownames(cum[1:25,]),las=3)

N-gram Tokenization

Now we build n-grams in a systematic way by using the RWeka library. The idea is, that by exploring which word combinations occur together very frequently, we will get a basis for the algorithm to develop.

Unigrams

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.5
unigram<-NGramTokenizer(sample_text2,Weka_control(min=1,max=1))
unigram1<-as.data.frame(table(unigram))
uni<-arrange(unigram1,desc(Freq))
uni$perc<-uni$Freq/sum(uni$Freq)*100
uni$cum[1]<-uni$perc[1]
 for(i in 2:nrow(uni)){uni$cum[i]<-uni$perc[i]+uni$cum[i-1]}
uni[1:30,]
##    unigram Freq      perc       cum
## 1      the 2581 4.4048127  4.404813
## 2       to 1668 2.8466593  7.251472
## 3      and 1365 2.3295503  9.581022
## 4        a 1351 2.3056575 11.886680
## 5        i 1259 2.1486475 14.035327
## 6       of 1111 1.8960662 15.931393
## 7       in  895 1.5274341 17.458828
## 8      you  744 1.2697329 18.728560
## 9       is  735 1.2543732 19.982934
## 10      it  687 1.1724550 21.155389
## 11     for  631 1.0768837 22.232272
## 12    that  604 1.0308047 23.263077
## 13      my  511 0.8720881 24.135165
## 14      on  485 0.8277157 24.962881
## 15    with  407 0.6945985 25.657479
## 16    this  388 0.6621725 26.319652
## 17    have  342 0.5836675 26.903319
## 18      at  326 0.5563615 27.459681
## 19     was  325 0.5546548 28.014336
## 20     are  320 0.5461217 28.560457
## 21     but  316 0.5392952 29.099753
## 22      be  315 0.5375885 29.637341
## 23      as  270 0.4607902 30.098131
## 24      we  270 0.4607902 30.558921
## 25     all  261 0.4454305 31.004352
## 26      me  256 0.4368973 31.441249
## 27     not  249 0.4249509 31.866200
## 28      so  248 0.4232443 32.289444
## 29    just  243 0.4147112 32.704156
## 30    they  217 0.3703388 33.074494

What caught my eye, was that the output differs from the one craeted via the DocumentTermMatrix. Seems like the former algorithm excludes words with less than three letters.

Bigrams

bigram<-NGramTokenizer(sample_text2,Weka_control(min=2,max=2))
bigram1<-as.data.frame(table(bigram))
bi<-arrange(bigram1,desc(Freq))
bi[1:20,]
##      bigram Freq
## 1    in the  238
## 2    of the  234
## 3    on the  121
## 4    to the  117
## 5   for the  112
## 6     to be  100
## 7      i am   75
## 8   and the   73
## 9      in a   73
## 10   i have   71
## 11 with the   71
## 12   at the   70
## 13    i was   69
## 14    it is   67
## 15    and i   63
## 16   if you   62
## 17    for a   61
## 18     is a   57
## 19 going to   55
## 20 from the   53

Trigrams

trigram<-NGramTokenizer(sample_text2,Weka_control(min=3,max=3))
trigram1<-as.data.frame(table(trigram))
tri<-arrange(trigram1,desc(Freq))
tri[1:20,]
##           trigram Freq
## 1      one of the   21
## 2  thanks for the   19
## 3        a lot of   16
## 4     is going to   15
## 5      out of the   15
## 6      the end of   12
## 7     going to be   11
## 8        i have a   11
## 9       i need to   11
## 10   in the world   11
## 11       i am not   10
## 12  the fact that   10
## 13        to be a   10
## 14       to go to   10
## 15     all of the    9
## 16     be able to    9
## 17   cant wait to    9
## 18      i have to    9
## 19    some of the    9
## 20     there is a    9

Next steps

As the aim of this project is building a Shiny App, which “predicts” words, based on the beginning of a sentence the user has to enter, I will go deeper into the subject NLP. I will think about which algorithms would be suitable for the project`s task. The general idea will be to deal with conditional probabilities. Depending on what sentence (“n-gram”) was entered, the probability for the next word to come will be calculated.