Intro

In this brief report we’ll explore the basic features of the 3 document datasets (Twitter, Blogs, and News articles). We’ll then find out some of the most commonly used words and sets of words so that we can see the best way to start building our model. The model for predicting the next word for the users will be built off of conditional probabilities. That is we’ll look at the past 2 to 3 words inputted from the users and suggest the word that has been most commonly used after that set of words as the following word.

Loading the Data

## Loading required package: NLP
## Warning: package 'quanteda' was built under R version 3.2.4
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:stats':
## 
##     df
## The following object is masked from 'package:base':
## 
##     sample
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning in readLines(con = "en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
## Warning in readLines(con = "en_US.news.txt"): incomplete final line found
## on 'en_US.news.txt'

Word and Line Count Analysis

line.count<-data.frame(twitter=length(df.twitter),blogs=length(df.blogs),news=length(df.news))
barplot(as.matrix(line.count),names.arg =c("Twitter","Blogs","News"),main="Number of Lines")

We see that there are the most Twitter posts then Blogs and finally News has the least number of entries. This is definitely expected given the typical lengths of each kind of posting.

words.twitter<-sapply(gregexpr(" +", df.twitter), length) + 1
summary(words.twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    7.00   12.00   12.87   18.00   47.00
hist(words.twitter,col = "blue",xlim=c(0,35),breaks=35,main="Histogram of Words for Twitter",xlab="Number of Words")

We see that Twitter posts almost always contain less that 30 words. This is to be expected as Twitter has a 140 character limit so only so many words can fit. Half of the posts have less than 12 words.

words.blogs<-sapply(gregexpr(" +", df.blogs), length) + 1
summary(words.blogs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    9.00   28.00   41.53   59.00 6630.00
hist(words.blogs,col = "blue",xlim=c(0,300),breaks=600,main="Histogram of Words for Blogs",xlab="Number of Words")

Blogs have much more right skew than the other sources as can be seen from the fact that the mean is much higher than the median. Still most blog posts are brief and contain less than 60 words. However a few number of blogs posts are very long (I guess this is the people who are rambling on and on and on…)

words.news<-sapply(gregexpr(" +", df.news), length) + 1
summary(words.news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   19.00   31.00   34.23   45.00 1031.00
hist(words.news,col = "blue",xlim=c(0,200),breaks=200,main="Histogram of Words for News",xlab="Number of Words")

News Articles also have a bit of right skew due to the fact that the left side is truncated at 1 word. Otherwise it is centered around 30 to 35 words and fairly symmetrical. Still most articles are less than 50 words. So overall Blogs and articles seem to be about the same length.

Start of Text Analysis

mycorpus<-corpus(my.doc, ignoredFeatures=c("will", stopwords("english")))
## Warning in corpus.character(my.doc, ignoredFeatures = c("will",
## stopwords("english"))): Argument ignoredFeatures not used.
summary(mycorpus)
## Corpus consisting of 3 documents.
## 
##   Text  Types  Tokens Sentences
##  text1 397831 7059848    509457
##  text2 284774 8776802    400734
##  text3  53030  619403     28070
## 
## Source:  C:/Users/Wesley/Documents/Data Science Capstone/final/en_US/* on x86-64 by Wesley
## Created: Sat Mar 19 23:22:27 2016
## Notes:
mydfm<-dfm(mycorpus)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 555,420 feature types
##    ... created a 3 x 555420 sparse dfm
##    ... complete. 
## Elapsed time: 43.7 seconds.
topfeatures(mydfm,20)
##    the     to    and      a      i     of     in    you     is    for 
## 572945 381253 314949 309875 266964 257022 200678 160132 159520 153109 
##   that     it     on     my   with    was   this     be   have     at 
## 140070 132900 111079 107780  94210  82020  80649  80555  77741  72728
plot(mydfm, max.words = 80, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))

We see that Blogs and Twitter have many more types than does News. This is not suprising as people tend to make a few “words” that are very informal and not typically used in a formal article. Also, from the Word Cloud we see some of the most common words in the English language are used the most (for examle, “the”,“and”,“a”, and “to”). This is not surprising as we use these kinds of connector words in all parts of speech.

my.token<-tokenize(my.doc, removeNumbers = TRUE, removePunct = TRUE, ngrams=1:3, removeTwitter = TRUE)

token.twitter<-arrange(as.data.frame(table(my.token[[1]])),desc(Freq))
names(token.twitter)<-c("Word","Freq")
barplot(token.twitter$Freq[1:10],names.arg = token.twitter$Word[1:10],col="Dark Green",main="Twitter Most Common Words")

token.blogs<-arrange(as.data.frame(table(my.token[[2]])),desc(Freq))
names(token.blogs)<-c("Word","Freq")
barplot(token.blogs$Freq[1:10],names.arg = token.blogs$Word[1:10],col="Dark Green",main="Blogs Most Common Words")

token.news<-arrange(as.data.frame(table(my.token[[3]])),desc(Freq))
names(token.news)<-c("Word","Freq")
barplot(token.twitter$Freq[1:10],names.arg = token.twitter$Word[1:10],col="Dark Green",main="News Most Common Words")

Overall we have a good summary of the data but additional cleaning of the data may be needed to make a better algorithm in the future.