This is an interim report to provide an update on progress towards building a predictive text model.
I find it easiest to use bash to understand the size of the three different text datasets. Bash’s wc command outputs the number of lines, words and characters.
system("wc final/en_US/*", intern=T)
## [1] " 899288 37334114 210160014 final/en_US/en_US.blogs.txt"
## [2] " 1010242 34365936 205811889 final/en_US/en_US.news.txt"
## [3] " 2360148 30359804 167105338 final/en_US/en_US.twitter.txt"
## [4] " 4269678 102059854 583077241 total"
I chose to look at what I thought would be the “dirtiest” data first: the twitter corpus. My thought was that this dataset would require the most work to clear. A number of specialized processing steps were implemented to do this.
The dataset was pretty large: 2360148 entries. While the entire file can be read into memory, it is too big for TDM creation and operations. So, 500,000 tweets were sampled and used to explore term usage, vocabulary size and cummulative distributions.
plot(cumsum(fq.bi)/sum(fq.bi), type="l", xlab="Freq Sorted Words", ylab="Percent of Words covered in corpus", ylim=c(0,1))
lines(cumsum(fq)/sum(fq), type="l", col="blue")
legend("bottomright", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")
par(mfcol=c(1,2))
plot(l.bi, type="l", xlab="Number of Documents (x1000)", ylab="Number of Unique Terms", main="Vocab Size")
lines(l, type="l", col="blue")
legend("topleft", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")
plot(log10(seq(1,length(l.bi))),log10(l.bi),type="l", xlab="log10 Number of Documents (x1000)", ylab="log10 Number of Unique Terms", main="Vocab Size", col="blue", ylim=c(3.5,6.5))
lines(y=log10(l), x=log10(seq(1,length(l))),type="l")
legend("bottomright", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")
To visualize the most frequent words in the corpus we can make use of a wordcloud to qualiatively assess any missteps in the processing. A histogram of word frequencies shows that term x freq matrices are sparse (a large number of 0’s in the matrix).
library(wordcloud)
par(mfcol=c(1,2))
wordcloud(rownames(fq), fq,rot.per = 0.25 ,max.words = 150, random.order = F,colors=brewer.pal(9, "BuGn")[-(1:4)])
wordcloud(rownames(fq.bi), fq.bi,rot.per = 0.25 ,max.words = 75, random.order = F,colors=brewer.pal(9, "BuGn")[-(1:4)])
hist(sqrt(fq), ylim=c(0,5000),breaks=1000, main="Unigram Term Frequency")
hist(sqrt(fq.bi), ylim=c(0,5000),breaks=1000, main="Bigram Term Frequency")
We see with unigrams that the complexity of the corpus is quite low. The number of terms to cover 50% and 90% of all words in the corpus is 234 and 8798, repsectively. Furhtermore we see that the number of unique terms gained as we add more tweets to the courpus diminishes quickly.
When we examine bigrams, we see that it takes many more words to capture 50% and 90% of the corpus. Furthermore, the vocabulary of bigrams is much larger and not as well covered by the sampled tweets.