This is an interim report to provide an update on progress towards building a predictive text model.

Line, word, and character counts

I find it easiest to use bash to understand the size of the three different text datasets. Bash’s wc command outputs the number of lines, words and characters.

system("wc final/en_US/*", intern=T)
## [1] "   899288  37334114 210160014 final/en_US/en_US.blogs.txt"  
## [2] "  1010242  34365936 205811889 final/en_US/en_US.news.txt"   
## [3] "  2360148  30359804 167105338 final/en_US/en_US.twitter.txt"
## [4] "  4269678 102059854 583077241 total"

I chose to look at what I thought would be the “dirtiest” data first: the twitter corpus. My thought was that this dataset would require the most work to clear. A number of specialized processing steps were implemented to do this.

  1. Everything was transformed to lowercase.
  2. Spelling was not checked nor were attempts made to correct mispelled words.
  3. Stopwords were not removed (we want to predict not gain meaning)
  4. Foreign language detection needs to be implemented and can likely be done by looking for accented characters and/or creating a foreign language dictionary. However, foreign words appear to be infrequent and thus I expect them to fall out of the model in the end.
  5. Special characters need to be removed, i.e., we want to remove ^_^ or :-) and sentence puncuation but we do not want to remove things like apostrophes because “dont” is not a word and if it is does not mean the same thing as “don’t.”
  6. Profanity: My opinion on the matter is that it shouldn’t be suggested in the prediction but needs to be present in the model. I cannot stop a user from entering profanity and thus it will provide some predictive value.
  7. Tokenization: Using the tm and NLP packages uni-, bi- and trigrams were tokenized from the twitter corpus into Term-Document Matrices.

Twitter Corpus Exploration.

The dataset was pretty large: 2360148 entries. While the entire file can be read into memory, it is too big for TDM creation and operations. So, 500,000 tweets were sampled and used to explore term usage, vocabulary size and cummulative distributions.

plot(cumsum(fq.bi)/sum(fq.bi), type="l", xlab="Freq Sorted Words", ylab="Percent of Words covered in corpus", ylim=c(0,1))
lines(cumsum(fq)/sum(fq), type="l", col="blue")
legend("bottomright", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")

par(mfcol=c(1,2))
plot(l.bi, type="l", xlab="Number of Documents (x1000)", ylab="Number of Unique Terms", main="Vocab Size")
lines(l, type="l", col="blue")
legend("topleft", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")

plot(log10(seq(1,length(l.bi))),log10(l.bi),type="l", xlab="log10 Number of Documents (x1000)", ylab="log10 Number of Unique Terms", main="Vocab Size", col="blue", ylim=c(3.5,6.5))
lines(y=log10(l), x=log10(seq(1,length(l))),type="l")
legend("bottomright", lty=c(1,1), col=c("black","blue"), legend=c("Bigrams","Unigrams"), bty="n")

Most frequent words in the Corpus

To visualize the most frequent words in the corpus we can make use of a wordcloud to qualiatively assess any missteps in the processing. A histogram of word frequencies shows that term x freq matrices are sparse (a large number of 0’s in the matrix).

library(wordcloud)
par(mfcol=c(1,2))
wordcloud(rownames(fq), fq,rot.per = 0.25 ,max.words = 150, random.order = F,colors=brewer.pal(9, "BuGn")[-(1:4)])

wordcloud(rownames(fq.bi), fq.bi,rot.per = 0.25 ,max.words = 75, random.order = F,colors=brewer.pal(9, "BuGn")[-(1:4)])

hist(sqrt(fq), ylim=c(0,5000),breaks=1000, main="Unigram Term Frequency")
hist(sqrt(fq.bi), ylim=c(0,5000),breaks=1000, main="Bigram Term Frequency")

Conclusions

We see with unigrams that the complexity of the corpus is quite low. The number of terms to cover 50% and 90% of all words in the corpus is 234 and 8798, repsectively. Furhtermore we see that the number of unique terms gained as we add more tweets to the courpus diminishes quickly.

When we examine bigrams, we see that it takes many more words to capture 50% and 90% of the corpus. Furthermore, the vocabulary of bigrams is much larger and not as well covered by the sampled tweets.