In this brief report we’ll explore the basic features of the 3 document datasets (Twitter, Blogs, and News articles). We’ll then find out some of the most commonly used words and sets of words so that we can see the best way to start building our model. The model for predicting the next word for the users will be built off of conditional probabilities. That is we’ll look at the past 2 to 3 words inputted from the users and suggest the word that has been most commonly used after that set of words as the following word.
## Loading required package: NLP
## Warning: package 'quanteda' was built under R version 3.2.4
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:stats':
##
## df
## The following object is masked from 'package:base':
##
## sample
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning in readLines(con = "en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.news.txt"): incomplete final line found
## on 'en_US.news.txt'
line.count<-data.frame(twitter=length(df.twitter),blogs=length(df.blogs),news=length(df.news))
barplot(as.matrix(line.count),names.arg =c("Twitter","Blogs","News"),main="Number of Lines")
We see that there are the most Twitter posts then Blogs and finally News has the least number of entries. This is definitely expected given the typical lengths of each kind of posting.
words.twitter<-sapply(gregexpr(" +", df.twitter), length) + 1
summary(words.twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 7.00 12.00 12.87 18.00 47.00
hist(words.twitter,col = "blue",xlim=c(0,35),breaks=35,main="Histogram of Words for Twitter",xlab="Number of Words")
We see that Twitter posts almost always contain less that 30 words. This is to be expected as Twitter has a 140 character limit so only so many words can fit. Half of the posts have less than 12 words.
words.blogs<-sapply(gregexpr(" +", df.blogs), length) + 1
summary(words.blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 9.00 28.00 41.53 59.00 6630.00
hist(words.blogs,col = "blue",xlim=c(0,300),breaks=600,main="Histogram of Words for Blogs",xlab="Number of Words")
Blogs have much more right skew than the other sources as can be seen from the fact that the mean is much higher than the median. Still most blog posts are brief and contain less than 60 words. However a few number of blogs posts are very long (I guess this is the people who are rambling on and on and on…)
words.news<-sapply(gregexpr(" +", df.news), length) + 1
summary(words.news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 19.00 31.00 34.23 45.00 1031.00
hist(words.news,col = "blue",xlim=c(0,200),breaks=200,main="Histogram of Words for News",xlab="Number of Words")
News Articles also have a bit of right skew due to the fact that the left side is truncated at 1 word. Otherwise it is centered around 30 to 35 words and fairly symmetrical. Still most articles are less than 50 words. So overall Blogs and articles seem to be about the same length.
mycorpus<-corpus(my.doc, ignoredFeatures=c("will", stopwords("english")))
## Warning in corpus.character(my.doc, ignoredFeatures = c("will",
## stopwords("english"))): Argument ignoredFeatures not used.
summary(mycorpus)
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences
## text1 397831 7059848 509457
## text2 284774 8776802 400734
## text3 53030 619403 28070
##
## Source: C:/Users/Wesley/Documents/Data Science Capstone/final/en_US/* on x86-64 by Wesley
## Created: Sat Mar 19 23:22:27 2016
## Notes:
mydfm<-dfm(mycorpus)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 555,420 feature types
## ... created a 3 x 555420 sparse dfm
## ... complete.
## Elapsed time: 43.7 seconds.
topfeatures(mydfm,20)
## the to and a i of in you is for
## 572945 381253 314949 309875 266964 257022 200678 160132 159520 153109
## that it on my with was this be have at
## 140070 132900 111079 107780 94210 82020 80649 80555 77741 72728
plot(mydfm, max.words = 80, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
We see that Blogs and Twitter have many more types than does News. This is not suprising as people tend to make a few “words” that are very informal and not typically used in a formal article. Also, from the Word Cloud we see some of the most common words in the English language are used the most (for examle, “the”,“and”,“a”, and “to”). This is not surprising as we use these kinds of connector words in all parts of speech.
my.token<-tokenize(my.doc, removeNumbers = TRUE, removePunct = TRUE, ngrams=1:3, removeTwitter = TRUE)
token.twitter<-arrange(as.data.frame(table(my.token[[1]])),desc(Freq))
names(token.twitter)<-c("Word","Freq")
barplot(token.twitter$Freq[1:10],names.arg = token.twitter$Word[1:10],col="Dark Green",main="Twitter Most Common Words")
token.blogs<-arrange(as.data.frame(table(my.token[[2]])),desc(Freq))
names(token.blogs)<-c("Word","Freq")
barplot(token.blogs$Freq[1:10],names.arg = token.blogs$Word[1:10],col="Dark Green",main="Blogs Most Common Words")
token.news<-arrange(as.data.frame(table(my.token[[3]])),desc(Freq))
names(token.news)<-c("Word","Freq")
barplot(token.twitter$Freq[1:10],names.arg = token.twitter$Word[1:10],col="Dark Green",main="News Most Common Words")
Overall we have a good summary of the data but additional cleaning of the data may be needed to make a better algorithm in the future.