This is a milestone report for the Coursera Data Science Capstone Project. I will do some exploratory analysis concerning the three data sets in question and in consequence report some interesting findings.
We first read in the three text files:
blogs<-readLines("Swift/final/en_US/en_US.blogs.txt", encoding="UTF-8")
con<-file("Swift/final/en_US/en_US.news.txt", open="rb")
news<-readLines(con, encoding="UTF-8")
close(con)
rm(con)
twitter<-readLines("Swift/final/en_US/en_US.twitter.txt", encoding="UTF-8")
Getting an idea of the size of the respective files , we set:
blogs_size<-file.info("Swift/final/en_US/en_US.blogs.txt")$size/1024/1024
news_size<-file.info("Swift/final/en_US/en_US.news.txt")$size/1024/1024
twitter_size<-file.info("Swift/final/en_US/en_US.twitter.txt")$size/1024/1024
The following statements retrieve the number of lines in each text:
blogs_lineNum<-length(blogs)
news_lineNum<-length(news)
twitter_lineNum<-length(twitter)
With this I intent to get the respective number of words by using an algorithm which counts the entities between spaces:
blogs_wCount<-<-sum(sapply(gregexpr("\\S+",blogs),length))
news_wCount<-<-sum(sapply(gregexpr("\\S+",news),length))
twitter_wCount<-sum(sapply(gregexpr("\\S+",twitter),length))
This leads to the following results: The “blogs” file’s size is about 200.42 Mb, it contains 899288 lines and about 37334131 words. Concerning the “news” file: size is about 196.28 Mb, it contains 1010242 lines and about 34372530 words. The “twitter” file’s size is about 159.36 Mb , it contains 2360148 lines and around 30373543 words.
Due to the fact of mere capacity of the given files, I create a smaller sample of it, containing about only 1% of information compared to the original data sets:
sample_blogs<-blogs[sample(1:length(blogs),0.01*length(blogs))]
sample_news<-news[sample(1:length(news),0.01*length(news))]
sample_twitter<-twitter[sample(1:length(twitter),0.01*length(twitter))]
sample_text<-c(sample_blogs,sample_news,sample_twitter)
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
## some cleaning
sample_text<-iconv(sample_text,"latin1","ASCII",sub="")
sample_text<-VectorSource(sample_text)
sample_text<-Corpus(sample_text)
sample_text<-tm_map(sample_text,tolower)
sample_text<-tm_map(sample_text, removePunctuation)
sample_text<-tm_map(sample_text, removeNumbers)
sample_text<-tm_map(sample_text,stripWhitespace)
sample_text<-tm_map(sample_text,PlainTextDocument)
A nice tool for visualizing word frequency is “Wordcloud”. The result looks like this:
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.3
wordcloud(sample_text,max.words=120)
The word cloud gives a nice idea about the distribution of different words in the text corpus- it is not very precise, though.
I will deal with percentages in the next step- more precisely, this means that I calculate the percentage of the occurrence of the respective words in the sample text, starting with the most frequent words (they are arranged in descending order concerning their frequency). I also calculate the cumulative percent rate. Due to the low capacity of my computer, it was not possible to use the whole data set for the next step. So I used a limited one to make the idea (and the respective algorithm) clear:
library(tm)
a<-sample_text[1:900]
b<-sample_text[21300:22200]
c<-sample_text[38300:39200]
sample_text2<-c(a,b,c)
sample_text2<-iconv(sample_text2,"latin1","ASCII",sub="")
sample_text2<-VectorSource(sample_text2)
sample_text2<-Corpus(sample_text2)
sample_text2<-tm_map(sample_text2,tolower)
sample_text2<-tm_map(sample_text2, removePunctuation)
sample_text2<-tm_map(sample_text2, removeNumbers)
sample_text2<-tm_map(sample_text2,stripWhitespace)
sample_text<-tm_map(sample_text2,PlainTextDocument)
library(tm)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dtm<-DocumentTermMatrix(sample_text)
dtm<-as.matrix(dtm)
freq<-colSums(dtm)
freq<-sort(freq,decreasing = TRUE)
cum<-as.data.frame(freq)
cum$perc<-freq/sum(freq)*100
cum$cum[1]<-cum$perc[1]
for(i in 2:nrow(cum)){cum$cum[i]<-cum$perc[i]+cum$cum[i-1]}
head(cum,25)
## freq perc cum
## the 2581 5.6621986 5.662199
## and 1365 2.9945374 8.656736
## you 744 1.6321874 10.288924
## for 631 1.3842880 11.673212
## that 604 1.3250554 12.998267
## with 407 0.8928767 13.891144
## this 388 0.8511945 14.742338
## have 342 0.7502797 15.492618
## was 325 0.7129851 16.205603
## are 320 0.7020161 16.907619
## but 316 0.6932409 17.600860
## all 261 0.5725819 18.173442
## not 249 0.5462563 18.719698
## just 243 0.5330935 19.252792
## they 217 0.4760547 19.728846
## what 214 0.4694733 20.198320
## your 214 0.4694733 20.667793
## from 212 0.4650857 21.132878
## its 211 0.4628919 21.595770
## out 207 0.4541167 22.049887
## one 187 0.4102407 22.460128
## will 185 0.4058531 22.865981
## like 184 0.4036593 23.269640
## about 176 0.3861089 23.655749
## can 174 0.3817213 24.037470
nwords<-as.matrix(dim(cum))[1,]
cum1<-mutate(cum,ind=1:nwords)
There are only about 240 (different) words needed to cover 50% of the text corpus:
cum1[237:242,]
## freq perc cum ind
## 237 24 0.05265121 49.93528 237
## 238 24 0.05265121 49.98793 238
## 239 24 0.05265121 50.04059 239
## 240 24 0.05265121 50.09324 240
## 241 24 0.05265121 50.14589 241
## 242 24 0.05265121 50.19854 242
We can also visualize the most frequent words in a barplot:
barplot(cum[1:25,2], names.arg=rownames(cum[1:25,]),las=3)
Now we build n-grams in a systematic way by using the RWeka library. The idea is, that by exploring which word combinations occur together very frequently, we will get a basis for the algorithm to develop.
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.5
unigram<-NGramTokenizer(sample_text2,Weka_control(min=1,max=1))
unigram1<-as.data.frame(table(unigram))
uni<-arrange(unigram1,desc(Freq))
uni$perc<-uni$Freq/sum(uni$Freq)*100
uni$cum[1]<-uni$perc[1]
for(i in 2:nrow(uni)){uni$cum[i]<-uni$perc[i]+uni$cum[i-1]}
uni[1:30,]
## unigram Freq perc cum
## 1 the 2581 4.4048127 4.404813
## 2 to 1668 2.8466593 7.251472
## 3 and 1365 2.3295503 9.581022
## 4 a 1351 2.3056575 11.886680
## 5 i 1259 2.1486475 14.035327
## 6 of 1111 1.8960662 15.931393
## 7 in 895 1.5274341 17.458828
## 8 you 744 1.2697329 18.728560
## 9 is 735 1.2543732 19.982934
## 10 it 687 1.1724550 21.155389
## 11 for 631 1.0768837 22.232272
## 12 that 604 1.0308047 23.263077
## 13 my 511 0.8720881 24.135165
## 14 on 485 0.8277157 24.962881
## 15 with 407 0.6945985 25.657479
## 16 this 388 0.6621725 26.319652
## 17 have 342 0.5836675 26.903319
## 18 at 326 0.5563615 27.459681
## 19 was 325 0.5546548 28.014336
## 20 are 320 0.5461217 28.560457
## 21 but 316 0.5392952 29.099753
## 22 be 315 0.5375885 29.637341
## 23 as 270 0.4607902 30.098131
## 24 we 270 0.4607902 30.558921
## 25 all 261 0.4454305 31.004352
## 26 me 256 0.4368973 31.441249
## 27 not 249 0.4249509 31.866200
## 28 so 248 0.4232443 32.289444
## 29 just 243 0.4147112 32.704156
## 30 they 217 0.3703388 33.074494
What caught my eye, was that the output differs from the one craeted via the DocumentTermMatrix. Seems like the former algorithm excludes words with less than three letters.
bigram<-NGramTokenizer(sample_text2,Weka_control(min=2,max=2))
bigram1<-as.data.frame(table(bigram))
bi<-arrange(bigram1,desc(Freq))
bi[1:20,]
## bigram Freq
## 1 in the 238
## 2 of the 234
## 3 on the 121
## 4 to the 117
## 5 for the 112
## 6 to be 100
## 7 i am 75
## 8 and the 73
## 9 in a 73
## 10 i have 71
## 11 with the 71
## 12 at the 70
## 13 i was 69
## 14 it is 67
## 15 and i 63
## 16 if you 62
## 17 for a 61
## 18 is a 57
## 19 going to 55
## 20 from the 53
trigram<-NGramTokenizer(sample_text2,Weka_control(min=3,max=3))
trigram1<-as.data.frame(table(trigram))
tri<-arrange(trigram1,desc(Freq))
tri[1:20,]
## trigram Freq
## 1 one of the 21
## 2 thanks for the 19
## 3 a lot of 16
## 4 is going to 15
## 5 out of the 15
## 6 the end of 12
## 7 going to be 11
## 8 i have a 11
## 9 i need to 11
## 10 in the world 11
## 11 i am not 10
## 12 the fact that 10
## 13 to be a 10
## 14 to go to 10
## 15 all of the 9
## 16 be able to 9
## 17 cant wait to 9
## 18 i have to 9
## 19 some of the 9
## 20 there is a 9
As the aim of this project is building a Shiny App, which “predicts” words, based on the beginning of a sentence the user has to enter, I will go deeper into the subject NLP. I will think about which algorithms would be suitable for the project`s task. The general idea will be to deal with conditional probabilities. Depending on what sentence (“n-gram”) was entered, the probability for the next word to come will be calculated.