Text Counts

The number of texts available by source are: blogs had 899288 entries; news had 77259 entries; and Twitter had 2360148 entries.

Word Counts

In the sample of 25% of the texts available, there were 146603 words accounting for a total count of 7140031 words in the whole sample.

The distribution of word counts (constrained to counts of 20 or less) look as follows:

N-Grams

Another way to look at the data is to see how many n-grams exist in the texts. An n-gram is a combination of words where a unigram is 1, bigram is 2, trigram is 3, and so forth. Seeing the high amount of single count n-grams, some trimming will need to be pursued.

Next Steps

  1. Reduce the n-gram model so that it runs more quickly - use only words that repeat more than once and remove stop words.
  2. Attempt to inlcude higher n-grams to add greater context.
  3. Implement a backoff model to go to lower n-grams for prediction.
  4. Attempt a smoothing algorithm to allow for unseen n-grams.
  5. Overcome challenge of using tidytext package with term/feature matrices.