n-Gram Predictive Model - step 1 Exploratory Data Analysis

  1. Basic Stats - Total number of lines in each Dataset
##    blogs  news  tweets
## 1 899288 77259 2360148

Distribution of 1) Most frequent words (wordmap) and 2) histograms ofr 1-gram, 2-gram and 3-gram grammars

What percentage of words account for 90% of all grammars?

41% (about 7,000 out of 17,000) of words account for 90% of language usage.