This document outlines my strategy for implementing a next word predictor “app” that takes a phrase as input and suggests a possible next word. Natural language processing is new to me, so I reviewed the topic on wikipedia, coursera, R documentation, Google and Youtube. The corpus of text data for model developement was provided on the course web site. It comes from three sources: blogs, news and twitter. See appendix 1 for basic statistics.

I reviewed the input data files and decided to search for answers to the quiz questions. I extracted the last 10 characters from the incomplete phrases provided on quiz 2 and compiled a list of all matching blog documents. Then I extracted the next word from each matching document and tablulated the next word frequencies. Finally I printed out the four most common next words for each question. This approach was partially successful. It identified words that matched one of the quiz answer choices in 4 out of 10 questions, but the most frequent word only matched a choice in 2 out of 10 questions which included the common tri-grams: “case of beer” and “quite some time”. This was encouraging, especially since I only processed the blogs and presumably the news and twitter data would improve the results. See Appendix 1 for a table of the search results.

The corpus was divided into random sets for development, training, validation and testing (1,60, 20, 20 percent respectively). The development dataset is a subset of the training dataset.

Using the development dataset, I printed frequency plots and tables (instead of histograms) for word types and ngram types. A “word type” is simply a word in the vocabulary: e.g. ‘the’ is the most common word. An ngram type is simply an ngram: e.g. ‘one of the’ was the most common tri-gram in the blogs and news corpus, however ‘thanks for the’ was most common tri-gram in the twitter corpus. Word types and ngram types were ranked (sorted) by their frequency by corpus (blogs, news or twitter). Each plot corresponds to the following table so you can compare the shape of the curve in the plot to the numerical values in the table. They illustrate that the most common word types cover a very large percentage of the corpus size: 1% of the word types covered up to 60% of the corpus. For bi-grams, 1% covered up to 25% of the corpus. Consistent with Zipf’s law, the plots illustrate that the “log base 10 of the cummulative percent of word types” is nearly proportional (~ straigt line) to the percent of the corpus that they cover. “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.” (http://en.wikipedia.org/wiki/Zipf%27s_law). When stop words are removed the coverage is much less and does not follow Zipf’s law.

I would like to develop a set of features from the corpus. Each feature will have an associated model that outputs several next word predictions and probabilities. The set of predicted next words and probabilities will be given to a selection model (e.g. random forest) that selects the most likely next word. If I have time and resources, the set of features could include associations and frequencies for words and ngrams, word and ngram similarity, part of speech and sentiment.

Development strategty for data input and exploration:

  1. Download three text files for Blogs, News and Twitter.
  2. Use readlines with byte read to create three corpus document lists.
  3. Create seeded random data extracts from each corpus: development 1% size, training 60%, validation 20%, testing 20%. Make the developement extract a subset of the training extract; use it for development and exploration.
  4. Tokenize words: preserve english contractions (’=>^), lowercase, remove numbers and punctuation, optionally remove stop words.
  5. Report on words: Corpus size, vocabulary size, word frequencies & ranks, vocabulary coverage of corpus, with and without stop words. See Appendix “A - WORDS”.
  6. Tokenize sentences and then ngrames within sentences. Ngram sizes: 2,3,4,5. With and without stop words.
  7. Report on ngrams: Corpus size, ngram vocabulary size, ngram frequencies $ rands, with and without stop words. See appendix “A - NGRAMS”.
  8. R script names: divideCorpus, WordFrequency, NgramFrequency
  9. (http://rpubs.com/seibeldb/68490)

Development strategy for features:

  1. Break documents into sentences. Twitter docs are <= 140 characters so assume sentence units.
  2. Create term-sentence frequencies and association indexes.
  3. Create ngrams from each sentence and create frequency and association indexes.
  4. Optimize frequency and association indexes: reduce sparsity by removing less significant items.
  5. Create function that outputs part of speech of a word.
  6. Create function that outputs postive/negative sentiment of a short phrase or word.
  7. Create a function to find the length of a phrase.
  8. R script names: tbd

Development strategy for models:

  1. create a function to select a random length phrase from a document
  2. select random phrases from samples
  3. pop the last word off each selected phrase to represent the “answer”
  4. what remains is called the “query”.
  5. extract “lastWord” from the “query”
  6. use “lastWord” and get four most likely associated words from term-document
  7. use “lastWord” and get four most likely associated words from term-sentence
  8. extract “last2gram”, last3gram, last4gram from the “query”
  9. use “lastWord” and get four most likely matching 2-grams, pop off last words
  10. use “last2gram” and get four most likely matching 3-grams, pop off last words
  11. use “last3gram” and get four most likely matching 4-grams, pop off last words
  12. determine part of speach from the “lastWord”
  13. determine sentiment of the “query”
  14. determine length of “query”
  15. plug all potential last words and probabilities into a random forest model
  16. train using “answer”
  17. R script names: tbd

Model training strategy:

  1. Prepare features from the training set
  2. Prepoare training data from the training set - select a random phrase from each document - repeat this for each document n times where n is random 1:3
  3. R script names: tbd

Shiny application strategy:

  1. Create a simple shiny user interface with a phrase input box, submit button and a box to display the next word prediction.
  2. The shiny server will use the model and indexes to predict the next word.
  3. Upload the model and indexes to the shiny server

Appendix 1 - Basic Statistics and Search Results

## 
## Basic Statistics
##                                     file lineCount meanLength maxLength
## 1        ./final/en_US/en_US.twitter.txt   2360148   68.68045       140
## 2          ./final/en_US/en_US.blogs.txt    899288    229.987     40833
## 3           ./final/en_US/en_US.news.txt   1010242   201.1628     11384
## 4 ./final/en_US/sample.en_US.twitter.txt      2423   68.21461       140
## 5   ./final/en_US/sample.en_US.blogs.txt       895   227.6615      1972
## 6    ./final/en_US/sample.en_US.news.txt      1050   199.2533       884
##   totalNchar
## 1  162096031
## 2  206824505
## 3  203223159
## 4     165284
## 5     203757
## 6     209216
## 
## Search Results
##     cgramText cgramCount    nw1Text nw1Count  nw2Text nw2Count   nw3Text
## 1   a case of        251       beer        8   making        3      much
## 2  d mean the         31 difference        6    world        5     death
## 3  ake me the         25       ages        1      agi        1      best
## 4  ng but the        290      truth        8     best        7   kitchen
## 5  ate at the        141        end       15     same        9      time
## 6  d be on my         10       blog        1   family        1      feet
## 7  quite some        305       time      247    since       12     thing
## 8  his little       1488       girl       52      guy       44      blog
## 9  during the       5668        day      286     week      191    summer
## 10 ou must be        234       able       19 follower        8 wondering

Appendix A - WORDS

##           xLeft xRight      y  freq  rank
## zzzzz      2.00 100.00 100.00     1 29481
## paused     1.63  42.53  95.00     2 12537
## absolute   1.00  10.00  83.90    11  2948
## therefore  0.61   4.06  75.00    32  1196
## quite      0.00   1.00  60.89   127   295
## here      -0.44   0.36  49.95   431   107
## from      -1.01   0.10  34.26  1451    29
## you       -1.39   0.04  24.65  3026    12
## it        -1.47   0.03  22.86  3983    10
## is        -1.52   0.03  21.78  4225     9
## that      -1.57   0.03  20.63  4566     8
## in        -1.62   0.02  19.40  5998     7
## i         -1.69   0.02  17.78  7759     6
## of        -1.77   0.02  15.68  8755     5
## a         -1.87   0.01  13.31  8938     4
## to        -1.99   0.01  10.89 10582     3
## and       -2.17   0.01   8.02 10928     2
## the       -2.47   0.00   5.06 18712     1

##           xLeft xRight      y  freq  rank
## zyprexa    2.00 100.00 100.00     1 30628
## unveil     1.67  47.15  95.00     2 14441
## kelley     1.00  10.00  80.94    12  3063
## dangerous  0.78   6.00  75.00    21  1839
## keep       0.00   1.00  54.59   111   306
## night     -0.20   0.63  50.01   166   194
## who       -0.99   0.10  32.10  1089    31
## at        -1.31   0.05  25.32  2108    15
## on        -1.49   0.03  21.81  2621    10
## is        -1.53   0.03  21.02  2776     9
## that      -1.58   0.03  20.17  3289     8
## for       -1.64   0.02  19.17  3474     7
## in        -1.71   0.02  18.12  6638     6
## of        -1.79   0.02  16.10  7574     5
## and       -1.88   0.01  13.80  8549     4
## a         -2.01   0.01  11.20  8622     3
## to        -2.19   0.01   8.58  8875     2
## the       -2.49   0.00   5.88 19347     1

##           xLeft xRight      y freq  rank
## zz^s       2.00 100.00 100.00    1 25394
## valid      1.62  41.97  95.00    2 10658
## candidate  1.00  10.00  84.60   10  2539
## heat       0.57   3.72  75.00   33   945
## song       0.00   1.00  59.47  151   254
## first     -0.31   0.49  50.00  315   124
## we        -1.01   0.10  28.56 1505    25
## your      -1.15   0.07  24.79 1733    18
## of        -1.40   0.04  18.64 3553    10
## is        -1.45   0.04  17.45 3593     9
## in        -1.50   0.03  16.25 3772     8
## for       -1.56   0.03  14.99 3846     7
## and       -1.63   0.02  13.70 4507     6
## you       -1.71   0.02  12.19 5787     5
## a         -1.80   0.02  10.25 6083     4
## i         -1.93   0.01   8.22 7186     3
## to        -2.10   0.01   5.81 7987     2
## the       -2.40   0.00   3.14 9374     1

##           xLeft xRight      y freq  rank
## zzzzz      2.00 100.00 100.00    1 28939
## lone       1.87  74.05  95.00    1 21430
## michael^s  1.27  18.65  75.00    5  5396
## pen        1.00  10.00  63.52   10  2894
## heads      0.69   4.94  50.00   20  1431
## yesterday  0.02   1.05  24.98   63   305
## nature     0.00   1.00  24.30   65   289
## give      -1.00   0.10   6.60  206    29
## work      -1.46   0.03   3.38  379    10
## life      -1.51   0.03   3.13  380     9
## made      -1.56   0.03   2.87  401     8
## love      -1.62   0.02   2.61  456     7
## day       -1.68   0.02   2.30  474     6
## good      -1.76   0.02   1.99  489     5
## back      -1.86   0.01   1.66  522     4
## make      -1.98   0.01   1.31  542     3
## people    -2.16   0.01   0.95  581     2
## time      -2.46   0.00   0.57  850     1

##             xLeft xRight      y freq  rank
## zyprexa      2.00 100.00 100.00    1 30103
## lenovo       1.87  73.44  95.00    1 22107
## frustration  1.26  18.36  75.00    5  5527
## bombs        1.00  10.00  63.71   10  3010
## split        0.70   5.02  50.00   21  1511
## important    0.06   1.14  25.00   63   343
## force        0.00   1.00  23.28   68   301
## including   -1.00   0.10   5.78  201    30
## make        -1.48   0.03   2.62  312    10
## percent     -1.52   0.03   2.43  338     9
## back        -1.58   0.03   2.22  338     8
## school      -1.63   0.02   2.00  344     7
## city        -1.70   0.02   1.79  382     6
## years       -1.78   0.02   1.55  458     5
## state       -1.88   0.01   1.26  470     4
## people      -2.00   0.01   0.97  477     3
## time        -2.18   0.01   0.67  510     2
## year        -2.48   0.00   0.35  564     1

##          xLeft xRight      y freq  rank
## zz^s      2.00 100.00 100.00    1 24871
## melus     1.87  73.62  95.00    1 18311
## champion  1.19  15.42  75.00    5  3834
## ouch      1.00  10.00  67.96    9  2487
## david     0.54   3.45  50.00   25   859
## problem   0.00   1.00  31.52   71   249
## crazy    -0.23   0.60  24.98  104   148
## year     -1.00   0.10  10.29  273    25
## people   -1.40   0.04   6.10  526    10
## back     -1.44   0.04   5.70  579     9
## lol      -1.49   0.03   5.26  717     8
## great    -1.55   0.03   4.71  731     7
## today    -1.62   0.02   4.16  750     6
## time     -1.70   0.02   3.59  774     5
## rt       -1.79   0.02   3.00  884     4
## day      -1.92   0.01   2.32  930     3
## good     -2.09   0.01   1.61 1023     2
## love     -2.40   0.00   0.83 1093     1

Appendix B - NGRAMS

## [1] "Blogs-Dev Ngram types: 186995 , Total Ngrams: 360610 \n"  
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 25.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                  
##                    log10.col2 cum.%.vocab cum.%.corpus freq   rank
## żywiec archducal         2.00      100.00       100.00    1 186995
## through manuscript       1.96       90.36        95.00    1 168964
## house well               1.71       51.79        75.00    1  96842
## but how                  1.05       11.29        50.00    2  21107
## actions of               1.00       10.00        48.67    2  18700
## corn syrup               0.70        5.00        40.33    4   9350
## about his                0.00        1.00        25.09   15   1870
## unable to                0.00        0.99        25.00   16   1850
## i want                  -1.00        0.10        10.53   86    187
## from the                -1.99        0.01         3.44  393     19
## it was                  -2.27        0.01         2.34  481     10
## and i                   -2.32        0.00         2.21  485      9
## it is                   -2.37        0.00         2.08  490      8
## and the                 -2.43        0.00         1.94  593      7
## for the                 -2.49        0.00         1.78  608      6
## to be                   -2.57        0.00         1.61  718      5
## on the                  -2.67        0.00         1.41  747      4
## to the                  -2.79        0.00         1.20  902      3
## in the                  -2.97        0.00         0.95 1542      2
## of the                  -3.27        0.00         0.52 1889      1
## [1] "Blogs-Dev Ngram types: 308428 , Total Ngrams: 351929 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                  
## [3] "Top 1 % of ngram types account for 7 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                
##                          log10.col2 cum.%.vocab cum.%.corpus freq   rank
## żywiec archducal brewery       2.00      100.00       100.00    1 308428
## we^ve updated the              1.97       94.29        95.00    1 290832
## scanning exciter lamps         1.85       71.47        75.00    1 220446
## i reminds me                   1.63       42.95        50.00    1 132463
## antics as she                  1.16       14.42        25.00    1  44481
## air and the                    1.00       10.00        21.12    1  30843
## see how much                   0.70        5.00        15.49    2  15421
## it took a                      0.00        1.00         7.04    4   3084
## and when i                    -1.00        0.10         2.12   13    308
## this is the                   -2.00        0.01         0.54   39     31
## this is a                     -2.49        0.00         0.24   67     10
## the rest of                   -2.53        0.00         0.22   67      9
## some of the                   -2.59        0.00         0.20   69      8
## out of the                    -2.64        0.00         0.18   71      7
## it was a                      -2.71        0.00         0.16   71      6
## a couple of                   -2.79        0.00         0.14   77      5
## as well as                    -2.89        0.00         0.12   78      4
## to be a                       -3.01        0.00         0.10   79      3
## a lot of                      -3.19        0.00         0.08  121      2
## one of the                    -3.49        0.00         0.04  154      1
## [1] "Blogs-Dev Ngram types: 336294 , Total Ngrams: 343546 \n" 
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 2.7 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                                  log10.col2 cum.%.vocab cum.%.corpus freq
## żywiec archducal brewery perhaps       2.00      100.00       100.00    1
## when asked why i                       1.98       94.89        95.00    1
## stumble upon friend feed               1.87       74.46        75.00    1
## live in and how                        1.69       48.92        50.00    1
## did take over animals                  1.37       23.38        25.00    1
## ante or that such                      1.00       10.00        11.90    1
## after you add your                     0.70        5.00         7.01    1
## or the beginning of                    0.00        1.00         2.67    2
## if you want a                         -1.00        0.10         0.65    4
## for the rest of                       -2.00        0.01         0.17   11
## when it comes to                      -2.53        0.00         0.07   18
## is one of the                         -2.57        0.00         0.07   19
## for the first time                    -2.62        0.00         0.06   19
## i am going to                         -2.68        0.00         0.06   20
## on the other hand                     -2.75        0.00         0.05   21
## one of the most                       -2.83        0.00         0.04   23
## the end of the                        -2.92        0.00         0.04   29
## at the same time                      -3.05        0.00         0.03   30
## at the end of                         -3.23        0.00         0.02   34
## the rest of the                       -3.53        0.00         0.01   37
##                                    rank
## żywiec archducal brewery perhaps 336294
## when asked why i                 319117
## stumble upon friend feed         250407
## live in and how                  164521
## did take over animals             78634
## ante or that such                 33629
## after you add your                16815
## or the beginning of                3363
## if you want a                       336
## for the rest of                      34
## when it comes to                     10
## is one of the                         9
## for the first time                    8
## i am going to                         7
## on the other hand                     6
## one of the most                       5
## the end of the                        4
## at the same time                      3
## at the end of                         2
## the rest of the                       1
## [1] "Blogs-Dev Ngram types: 334204 , Total Ngrams: 335516 \n" 
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 1.4 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                                      log10.col2 cum.%.vocab cum.%.corpus
## żywiec archducal brewery perhaps the       2.00      100.00       100.00
## when my name was called                    1.98       94.98        95.00
## taking pictures of the championship        1.87       74.90        75.00
## makes some sort of sense                   1.70       49.80        50.00
## elegance examples of which we              1.39       24.71        25.00
## area to hunt the birds                     1.00       10.00        10.35
## also no information on the                 0.70        5.00         5.37
## a few questions to be                      0.00        1.00         1.39
## every minute of it i                      -1.00        0.10         0.28
## i don^t want to be                        -2.01        0.01         0.06
## for those of you who                      -2.52        0.00         0.03
## the rest of the world                     -2.57        0.00         0.02
## it was going to be                        -2.62        0.00         0.02
## at the same time i                        -2.68        0.00         0.02
## the fairmont hotel in kansas              -2.75        0.00         0.02
## out keep freaking out keep                -2.83        0.00         0.02
## fairmont hotel in kansas city             -2.92        0.00         0.01
## keep freaking out keep freaking           -3.05        0.00         0.01
## freaking out keep freaking out            -3.22        0.00         0.01
## at the end of the                         -3.52        0.00         0.01
##                                      freq   rank
## żywiec archducal brewery perhaps the    1 334204
## when my name was called                 1 317428
## taking pictures of the championship     1 250325
## makes some sort of sense                1 166446
## elegance examples of which we           1  82567
## area to hunt the birds                  1  33420
## also no information on the              1  16710
## a few questions to be                   1   3342
## every minute of it i                    2    334
## i don^t want to be                      4     33
## for those of you who                    6     10
## the rest of the world                   7      9
## it was going to be                      7      8
## at the same time i                      7      7
## the fairmont hotel in kansas            8      6
## out keep freaking out keep              8      5
## fairmont hotel in kansas city           8      4
## keep freaking out keep freaking         9      3
## freaking out keep freaking out          9      2
## at the end of the                      20      1
## [1] "Blogs-Dev Ngram types: 327484 , Total Ngrams: 327810 \n" 
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                                                       log10.col2
## żywiec archducal brewery perhaps the only                   2.00
## when necessary to keep an eye                               1.98
## tasting session in the classroom bakerie^s                  1.87
## many fruit notes that filled the                            1.70
## enjoying several rare yellow-billed oxpeckers amongst       1.40
## art by john harris macmillan audio                          1.00
## am glad for the day after                                   0.70
## a great time meals were ok                                  0.00
## -month-old great-nephew came to live with                  -1.00
## all the light fixtures in the                              -2.00
## there a left wing talking head                             -2.52
## it was going to be a                                       -2.56
## is there a left wing talking                               -2.61
## fairmont hotel in kansas city of                           -2.67
## the grand theater bismarck north dakota                    -2.74
## at the end of the day                                      -2.82
## the fairmont hotel in kansas city                          -2.91
## out keep freaking out keep freaking                        -3.04
## freaking out keep freaking out keep                        -3.21
## keep freaking out keep freaking out                        -3.52
##                                                       cum.%.vocab
## żywiec archducal brewery perhaps the only                  100.00
## when necessary to keep an eye                               94.99
## tasting session in the classroom bakerie^s                  74.97
## many fruit notes that filled the                            49.95
## enjoying several rare yellow-billed oxpeckers amongst       24.93
## art by john harris macmillan audio                          10.00
## am glad for the day after                                    5.00
## a great time meals were ok                                   1.00
## -month-old great-nephew came to live with                    0.10
## all the light fixtures in the                                0.01
## there a left wing talking head                               0.00
## it was going to be a                                         0.00
## is there a left wing talking                                 0.00
## fairmont hotel in kansas city of                             0.00
## the grand theater bismarck north dakota                      0.00
## at the end of the day                                        0.00
## the fairmont hotel in kansas city                            0.00
## out keep freaking out keep freaking                          0.00
## freaking out keep freaking out keep                          0.00
## keep freaking out keep freaking out                          0.00
##                                                       cum.%.corpus freq
## żywiec archducal brewery perhaps the only                   100.00    1
## when necessary to keep an eye                                95.00    1
## tasting session in the classroom bakerie^s                   75.00    1
## many fruit notes that filled the                             50.00    1
## enjoying several rare yellow-billed oxpeckers amongst        25.00    1
## art by john harris macmillan audio                           10.09    1
## am glad for the day after                                     5.09    1
## a great time meals were ok                                    1.10    1
## -month-old great-nephew came to live with                     0.20    1
## all the light fixtures in the                                 0.03    2
## there a left wing talking head                                0.02    4
## it was going to be a                                          0.02    4
## is there a left wing talking                                  0.02    4
## fairmont hotel in kansas city of                              0.01    4
## the grand theater bismarck north dakota                       0.01    5
## at the end of the day                                         0.01    5
## the fairmont hotel in kansas city                             0.01    8
## out keep freaking out keep freaking                           0.01    8
## freaking out keep freaking out keep                           0.01    8
## keep freaking out keep freaking out                           0.00    9
##                                                         rank
## żywiec archducal brewery perhaps the only             327484
## when necessary to keep an eye                         311093
## tasting session in the classroom bakerie^s            245531
## many fruit notes that filled the                      163579
## enjoying several rare yellow-billed oxpeckers amongst  81626
## art by john harris macmillan audio                     32748
## am glad for the day after                              16374
## a great time meals were ok                              3275
## -month-old great-nephew came to live with                327
## all the light fixtures in the                             33
## there a left wing talking head                            10
## it was going to be a                                       9
## is there a left wing talking                               8
## fairmont hotel in kansas city of                           7
## the grand theater bismarck north dakota                    6
## at the end of the day                                      5
## the fairmont hotel in kansas city                          4
## out keep freaking out keep freaking                        3
## freaking out keep freaking out keep                        2
## keep freaking out keep freaking out                        1
## [1] "Blogs-Dev Ngram types: 320289 , Total Ngrams: 320419 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                  
## [3] "Top 1 % of ngram types account for 1 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                
##                                                       log10.col2
## żywiec archducal brewery perhaps the only one               2.00
## when marty became borough president in one                  1.98
## taxes and pyramids and diamonds and oligarchy               1.88
## many pills and bags of cocaine she                          1.70
## entire country will be hostile to it                        1.40
## artistically-minded baby who wants to decorate the          1.00
## am not married to a cowboy but                              0.70
## a heart shaped cherry in the design                         0.00
## a batch of brainless belgian golden ale                    -1.00
## development events coordinator is responsible for the      -2.00
## a pair of shoes you have to                                -2.51
## a good range of eventualities that can                     -2.55
## a floured surface with a rolling pin                       -2.60
## refunds for price drops of or more                         -2.66
## pins are all of the places i^ve                            -2.73
## the fairmont hotel in kansas city of                       -2.81
## is there a left wing talking head                          -2.90
## out keep freaking out keep freaking out                    -3.03
## keep freaking out keep freaking out keep                   -3.20
## freaking out keep freaking out keep freaking               -3.51
##                                                       cum.%.vocab
## żywiec archducal brewery perhaps the only one              100.00
## when marty became borough president in one                  95.00
## taxes and pyramids and diamonds and oligarchy               74.99
## many pills and bags of cocaine she                          49.98
## entire country will be hostile to it                        24.97
## artistically-minded baby who wants to decorate the          10.00
## am not married to a cowboy but                               5.00
## a heart shaped cherry in the design                          1.00
## a batch of brainless belgian golden ale                      0.10
## development events coordinator is responsible for the        0.01
## a pair of shoes you have to                                  0.00
## a good range of eventualities that can                       0.00
## a floured surface with a rolling pin                         0.00
## refunds for price drops of or more                           0.00
## pins are all of the places i^ve                              0.00
## the fairmont hotel in kansas city of                         0.00
## is there a left wing talking head                            0.00
## out keep freaking out keep freaking out                      0.00
## keep freaking out keep freaking out keep                     0.00
## freaking out keep freaking out keep freaking                 0.00
##                                                       cum.%.corpus freq
## żywiec archducal brewery perhaps the only one               100.00    1
## when marty became borough president in one                   95.00    1
## taxes and pyramids and diamonds and oligarchy                75.00    1
## many pills and bags of cocaine she                           50.00    1
## entire country will be hostile to it                         25.00    1
## artistically-minded baby who wants to decorate the           10.04    1
## am not married to a cowboy but                                5.04    1
## a heart shaped cherry in the design                           1.04    1
## a batch of brainless belgian golden ale                       0.14    1
## development events coordinator is responsible for the         0.03    2
## a pair of shoes you have to                                   0.01    2
## a good range of eventualities that can                        0.01    2
## a floured surface with a rolling pin                          0.01    2
## refunds for price drops of or more                            0.01    3
## pins are all of the places i^ve                               0.01    3
## the fairmont hotel in kansas city of                          0.01    4
## is there a left wing talking head                             0.01    4
## out keep freaking out keep freaking out                       0.01    8
## keep freaking out keep freaking out keep                      0.00    8
## freaking out keep freaking out keep freaking                  0.00    8
##                                                         rank
## żywiec archducal brewery perhaps the only one         320289
## when marty became borough president in one            304268
## taxes and pyramids and diamonds and oligarchy         240184
## many pills and bags of cocaine she                    160079
## entire country will be hostile to it                   79975
## artistically-minded baby who wants to decorate the     32029
## am not married to a cowboy but                         16014
## a heart shaped cherry in the design                     3203
## a batch of brainless belgian golden ale                  320
## development events coordinator is responsible for the     32
## a pair of shoes you have to                               10
## a good range of eventualities that can                     9
## a floured surface with a rolling pin                       8
## refunds for price drops of or more                         7
## pins are all of the places i^ve                            6
## the fairmont hotel in kansas city of                       5
## is there a left wing talking head                          4
## out keep freaking out keep freaking out                    3
## keep freaking out keep freaking out keep                   2
## freaking out keep freaking out keep freaking               1
## [1] "News-Dev Ngram types: 183857 , Total Ngrams: 319060 \n"   
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 21.4 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                  
##            log10.col2 cum.%.vocab cum.%.corpus freq   rank
## zynga inc        2.00      100.00       100.00    1 183857
## to around        1.96       91.32        95.00    1 167904
## kansas won       1.75       56.62        75.00    1 104092
## sponsor of       1.19       15.44        50.00    2  28382
## better as        1.00       10.00        43.73    2  18386
## shift in         0.70        5.00        35.58    4   9193
## the press        0.21        1.62        25.00    9   2973
## and told         0.00        1.00        21.35   12   1839
## number of       -1.00        0.10         9.25   64    184
## will be         -2.01        0.01         3.38  280     18
## with the        -2.26        0.01         2.61  436     10
## to be           -2.31        0.00         2.47  445      9
## and the         -2.36        0.00         2.33  497      8
## in a            -2.42        0.00         2.17  507      7
## at the          -2.49        0.00         2.01  620      6
## for the         -2.57        0.00         1.82  678      5
## on the          -2.66        0.00         1.61  759      4
## to the          -2.79        0.00         1.37  840      3
## in the          -2.96        0.00         1.11 1678      2
## of the          -3.26        0.00         0.58 1853      1
## [1] "News-Dev Ngram types: 279094 , Total Ngrams: 309157 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                 
## [3] "Top 1 % of ngram types account for 6 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."               
##                      log10.col2 cum.%.vocab cum.%.corpus freq   rank
## zynga inc venture          2.00      100.00       100.00    1 279094
## was willfully blind        1.98       94.46        95.00    1 263636
## scale given an             1.86       72.31        75.00    1 201805
## in saw values              1.65       44.61        50.00    1 124515
## baylor shooter brady       1.23       16.92        25.00    1  47226
## also understand i          1.00       10.00        18.75    1  27909
## way that had               0.70        5.00        14.01    2  13955
## and that^s a               0.00        1.00         6.05    3   2791
## that would be             -1.00        0.10         1.79   11    279
## a couple of               -2.00        0.01         0.46   33     28
## as well as                -2.45        0.00         0.24   51     10
## going to be               -2.49        0.00         0.22   54      9
## the end of                -2.54        0.00         0.20   55      8
## some of the               -2.60        0.00         0.18   55      7
## part of the               -2.67        0.00         0.17   55      6
## in the first              -2.75        0.00         0.15   55      5
## according to the          -2.84        0.00         0.13   64      4
## a lot of                  -2.97        0.00         0.11   96      3
## the u s                   -3.14        0.00         0.08  104      2
## one of the                -3.45        0.00         0.05  140      1
## [1] "News-Dev Ngram types: 294482 , Total Ngrams: 299469 \n"  
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 2.5 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                               log10.col2 cum.%.vocab cum.%.corpus freq
## zynga inc venture firms             2.00      100.00       100.00    1
## weighty subject of racial           1.98       94.92        95.00    1
## staple gun and then                 1.87       74.58        75.00    1
## lost her home scores                1.69       49.15        50.00    1
## defined contribution plan but       1.38       23.73        25.00    1
## anywhere at the center              1.00       10.00        11.50    1
## after a -month moratorium           0.70        5.00         6.58    1
## to be with her                      0.00        1.00         2.50    2
## a win over the                     -1.00        0.10         0.59    3
## to the u s                         -2.01        0.01         0.16   11
## said in a statement                -2.47        0.00         0.08   17
## at the university of               -2.51        0.00         0.07   17
## at the same time                   -2.57        0.00         0.06   19
## in the u s                         -2.62        0.00         0.06   20
## a m to p                           -2.69        0.00         0.05   20
## we re going to                     -2.77        0.00         0.04   23
## m to p m                           -2.87        0.00         0.04   23
## at the end of                      -2.99        0.00         0.03   25
## for the first time                 -3.17        0.00         0.02   31
## the end of the                     -3.47        0.00         0.01   32
##                                 rank
## zynga inc venture firms       294482
## weighty subject of racial     279509
## staple gun and then           219615
## lost her home scores          144747
## defined contribution plan but  69880
## anywhere at the center         29448
## after a -month moratorium      14724
## to be with her                  2945
## a win over the                   294
## to the u s                        29
## said in a statement               10
## at the university of               9
## at the same time                   8
## in the u s                         7
## a m to p                           6
## we re going to                     5
## m to p m                           4
## at the end of                      3
## for the first time                 2
## the end of the                     1
## [1] "News-Dev Ngram types: 289052 , Total Ngrams: 289991 \n"  
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 1.3 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                                  log10.col2 cum.%.vocab cum.%.corpus freq
## zynga inc venture firms doubt          2.00      100.00       100.00    1
## were able to get the                   1.98       94.98        95.00    1
## story set in the tumultuous            1.87       74.92        75.00    1
## market making and trading across       1.70       49.84        50.00    1
## doubts caused by their late            1.39       24.76        25.00    1
## ariz they^ll do so with                1.00       10.00        10.29    1
## all the stops to praise                0.70        5.00         5.31    1
## a draw by majority decision            0.00        1.00         1.32    1
## has been working with the             -1.00        0.10         0.27    2
## president of the united states        -2.00        0.01         0.06    4
## not be reached for comment            -2.46        0.00         0.03    6
## for the first time in                 -2.51        0.00         0.03    6
## could not be reached for              -2.56        0.00         0.03    6
## at the time of the                    -2.62        0.00         0.03    6
## the end of the year                   -2.68        0.00         0.02    7
## in the middle of the                  -2.76        0.00         0.02    7
## by the end of the                     -2.86        0.00         0.02    8
## from a m to p                         -2.98        0.00         0.02   10
## at the end of the                     -3.16        0.00         0.01   16
## a m to p m                            -3.46        0.00         0.01   20
##                                    rank
## zynga inc venture firms doubt    289052
## were able to get the             274552
## story set in the tumultuous      216554
## market making and trading across 144056
## doubts caused by their late       71559
## ariz they^ll do so with           28905
## all the stops to praise           14453
## a draw by majority decision        2891
## has been working with the           289
## president of the united states       29
## not be reached for comment           10
## for the first time in                 9
## could not be reached for              8
## at the time of the                    7
## the end of the year                   6
## in the middle of the                  5
## by the end of the                     4
## from a m to p                         3
## at the end of the                     2
## a m to p m                            1
## [1] "News-Dev Ngram types: 280423 , Total Ngrams: 280670 \n"  
## [2] "Vocabulary Ngrams Sorted by Frequency"                   
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                 
##                                           log10.col2 cum.%.vocab
## zynga inc venture firms doubt that              2.00      100.00
## were approved one program created in            1.98       95.00
## stretches of mostly isolated beach broken       1.87       74.98
## maw breaks the surface grabs the                1.70       49.96
## drug war but with an enhanced                   1.40       24.93
## art silicon valley law foundation child         1.00       10.00
## almost joined the playoff as she                0.70        5.00
## a freelance writer in south russell             0.00        1.00
## -foot- -pound lineman jerry deloach was        -1.00        0.10
## aggravated battery to a child causing          -2.00        0.01
## at the end of the month                        -2.45        0.00
## at the end of the day                          -2.49        0.00
## at the beginning of the year                   -2.54        0.00
## and a m to p m                                 -2.60        0.00
## -year-old resident of the block of             -2.67        0.00
## a m to p m saturday                            -2.75        0.00
## by the end of the year                         -2.85        0.00
## we re going to have to                         -2.97        0.00
## could not be reached for comment               -3.15        0.00
## from a m to p m                                -3.45        0.00
##                                           cum.%.corpus freq   rank
## zynga inc venture firms doubt that              100.00    1 280423
## were approved one program created in             95.00    1 266389
## stretches of mostly isolated beach broken        75.00    1 210255
## maw breaks the surface grabs the                 50.00    1 140088
## drug war but with an enhanced                    25.00    1  69920
## art silicon valley law foundation child          10.08    1  28042
## almost joined the playoff as she                  5.08    1  14021
## a freelance writer in south russell               1.09    1   2804
## -foot- -pound lineman jerry deloach was           0.19    1    280
## aggravated battery to a child causing             0.03    2     28
## at the end of the month                           0.02    3     10
## at the end of the day                             0.02    3      9
## at the beginning of the year                      0.01    3      8
## and a m to p m                                    0.01    3      7
## -year-old resident of the block of                0.01    3      6
## a m to p m saturday                               0.01    4      5
## by the end of the year                            0.01    5      4
## we re going to have to                            0.01    6      3
## could not be reached for comment                  0.01    6      2
## from a m to p m                                   0.00   10      1
## [1] "News-Dev Ngram types: 271391 , Total Ngrams: 271486 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                 
## [3] "Top 1 % of ngram types account for 1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."               
##                                                    log10.col2 cum.%.vocab
## zynga inc venture firms doubt that the                   2.00      100.00
## were anxious about the coming budget deliberations       1.98       95.00
## striking down the individual mandate the supreme         1.88       74.99
## may have been involved ray scott reminded                1.70       49.98
## dum maaro dum another ambitious screen effort            1.40       24.97
## artists to increasing its season subscriptions and       1.00       10.00
## along a remote new york beach highway                    0.70        5.00
## a game here at their team dinners                        0.00        1.00
## -on- workouts after surgery to repair torn              -1.00        0.10
## fat g saturated mg cholesterol mg sodium                -2.00        0.01
## anonymity because he was not authorized to              -2.43        0.00
## angry that silva was planning to leave                  -2.48        0.00
## and killed her on oct hours before                      -2.53        0.00
## and ask her friend how she would                        -2.59        0.00
## and a m to p m saturday                                 -2.66        0.00
## a -year-old resident of the block of                    -2.73        0.00
## the centers for disease control and prevention          -2.83        0.00
## protein g carbohydrate g fat g saturated                -2.96        0.00
## g protein g carbohydrate g fat g                        -3.13        0.00
## calories g protein g carbohydrate g fat                 -3.43        0.00
##                                                    cum.%.corpus freq
## zynga inc venture firms doubt that the                   100.00    1
## were anxious about the coming budget deliberations        95.00    1
## striking down the individual mandate the supreme          75.00    1
## may have been involved ray scott reminded                 50.00    1
## dum maaro dum another ambitious screen effort             25.00    1
## artists to increasing its season subscriptions and        10.03    1
## along a remote new york beach highway                      5.03    1
## a game here at their team dinners                          1.03    1
## -on- workouts after surgery to repair torn                 0.13    1
## fat g saturated mg cholesterol mg sodium                   0.02    2
## anonymity because he was not authorized to                 0.01    2
## angry that silva was planning to leave                     0.01    2
## and killed her on oct hours before                         0.01    2
## and ask her friend how she would                           0.01    2
## and a m to p m saturday                                    0.01    2
## a -year-old resident of the block of                       0.01    2
## the centers for disease control and prevention             0.00    3
## protein g carbohydrate g fat g saturated                   0.00    3
## g protein g carbohydrate g fat g                           0.00    3
## calories g protein g carbohydrate g fat                    0.00    3
##                                                      rank
## zynga inc venture firms doubt that the             271391
## were anxious about the coming budget deliberations 257817
## striking down the individual mandate the supreme   203519
## may have been involved ray scott reminded          135648
## dum maaro dum another ambitious screen effort       67776
## artists to increasing its season subscriptions and  27139
## along a remote new york beach highway               13570
## a game here at their team dinners                    2714
## -on- workouts after surgery to repair torn            271
## fat g saturated mg cholesterol mg sodium               27
## anonymity because he was not authorized to             10
## angry that silva was planning to leave                  9
## and killed her on oct hours before                      8
## and ask her friend how she would                        7
## and a m to p m saturday                                 6
## a -year-old resident of the block of                    5
## the centers for disease control and prevention          4
## protein g carbohydrate g fat g saturated                3
## g protein g carbohydrate g fat g                        2
## calories g protein g carbohydrate g fat                 1
## [1] "Twitter-Dev Ngram types: 147556 , Total Ngrams: 275117 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 23.8 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."                  
##             log10.col2 cum.%.vocab cum.%.corpus freq   rank
## zz^s u            2.00      100.00       100.00    1 147556
## top bloke         1.96       90.68        95.00    1 133800
## is piercing       1.73       53.39        75.00    1  78777
## him once          1.09       12.29        50.00    2  18138
## at sxsw           1.00       10.00        47.54    2  14756
## has more          0.70        5.00        39.38    4   7378
## me my             0.06        1.16        25.00   14   1705
## way of            0.00        1.00        23.78   16   1476
## i like           -1.00        0.10         9.26   83    148
## thank you        -1.99        0.01         2.41  297     15
## if you           -2.17        0.01         1.80  363     10
## i love           -2.21        0.01         1.67  383      9
## you re           -2.27        0.01         1.53  390      8
## thanks for       -2.32        0.00         1.39  426      7
## to the           -2.39        0.00         1.24  434      6
## to be            -2.47        0.00         1.08  461      5
## on the           -2.57        0.00         0.91  461      4
## of the           -2.69        0.00         0.74  531      3
## for the          -2.87        0.00         0.55  739      2
## in the           -3.17        0.00         0.28  776      1
## [1] "Twitter-Dev Ngram types: 220586 , Total Ngrams: 251535 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 7.4 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                  
##                      log10.col2 cum.%.vocab cum.%.corpus freq   rank
## zz^s u no                  2.00      100.00       100.00    1 220586
## where i^m from             1.97       94.30        95.00    1 208009
## scene am tired             1.85       71.49        75.00    1 157702
## ifitellyou one thing       1.63       42.98        50.00    1  94818
## at webbinno on             1.16       14.48        25.00    1  31935
## also tracy needed          1.00       10.00        21.07    1  22059
## that and i                 0.70        5.00        15.73    2  11029
## is at the                  0.00        1.00         7.38    4   2206
## i think we                -1.00        0.10         2.28   14    221
## would love to             -2.00        0.01         0.61   48     22
## i have a                  -2.34        0.00         0.36   60     10
## a lot of                  -2.39        0.00         0.34   63      9
## thank you for             -2.44        0.00         0.31   67      8
## i need to                 -2.50        0.00         0.29   68      7
## can^t wait to             -2.57        0.00         0.26   69      6
## going to be               -2.64        0.00         0.23   74      5
## for the follow            -2.74        0.00         0.20   82      4
## looking forward to        -2.87        0.00         0.17   90      3
## i love you                -3.04        0.00         0.13   91      2
## thanks for the            -3.34        0.00         0.10  248      1
## [1] "Twitter-Dev Ngram types: 223002 , Total Ngrams: 228572 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 2.9 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                  
##                             log10.col2 cum.%.vocab cum.%.corpus freq
## zz^s u no how                     2.00      100.00       100.00    1
## wines for the kids                1.98       94.87        95.00    1
## stop playing the game             1.87       74.38        75.00    1
## liking this author new            1.69       48.75        50.00    1
## east coast devastation hope       1.36       23.13        25.00    1
## at pm tune in                     1.00       10.00        12.19    1
## all cumfied out for               0.70        5.00         7.31    1
## look at the banana                0.00        1.00         2.88    2
## day to all the                   -1.00        0.10         0.76    4
## hope to see you                  -2.01        0.01         0.20   12
## for the first time               -2.35        0.00         0.13   18
## have a great day                 -2.39        0.00         0.12   20
## at the same time                 -2.45        0.00         0.11   20
## can^t wait to see                -2.50        0.00         0.10   22
## thank you so much                -2.57        0.00         0.09   23
## the rest of the                  -2.65        0.00         0.08   24
## thank you for the                -2.75        0.00         0.07   25
## is going to be                   -2.87        0.00         0.06   25
## thanks for the rt                -3.05        0.00         0.05   42
## thanks for the follow            -3.35        0.00         0.03   70
##                               rank
## zz^s u no how               223002
## wines for the kids          211573
## stop playing the game       165859
## liking this author new      108716
## east coast devastation hope  51573
## at pm tune in                22300
## all cumfied out for          11150
## look at the banana            2230
## day to all the                 223
## hope to see you                 22
## for the first time              10
## have a great day                 9
## at the same time                 8
## can^t wait to see                7
## thank you so much                6
## the rest of the                  5
## thank you for the                4
## is going to be                   3
## thanks for the rt                2
## thanks for the follow            1
## [1] "Twitter-Dev Ngram types: 205635 , Total Ngrams: 206675 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 1.5 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                  
##                                log10.col2 cum.%.vocab cum.%.corpus freq
## zz^s u no how i                      2.00      100.00       100.00    1
## with a comfortable minutes to        1.98       94.97        95.00    1
## tab and htc evo shift                1.87       74.87        75.00    1
## love to the beastie boys             1.70       49.75        50.00    1
## final page of the book               1.39       24.62        25.00    1
## bakker nor any other minister        1.00       10.00        10.45    1
## an exciting day kindras sencys       0.70        5.00         5.48    1
## a flight with coach and              0.00        1.00         1.50    1
## best one read am on                 -1.00        0.10         0.31    2
## at the end of the                   -1.99        0.01         0.06    4
## thanks for the shout out            -2.31        0.00         0.03    6
## thank you for the rt                -2.36        0.00         0.03    6
## let me know when you                -2.41        0.00         0.03    6
## it^s going to be a                  -2.47        0.00         0.02    6
## for a chance to win                 -2.53        0.00         0.02    6
## can^t wait to see you               -2.61        0.00         0.02    6
## buzz buzz buzz buzz buzz            -2.71        0.00         0.01    6
## what do you think of                -2.84        0.00         0.01    7
## in the middle of the                -3.01        0.00         0.01    7
## keep up the good work               -3.31        0.00         0.00   10
##                                  rank
## zz^s u no how i                205635
## with a comfortable minutes to  195301
## tab and htc evo shift          153966
## love to the beastie boys       102297
## final page of the book          50629
## bakker nor any other minister   20564
## an exciting day kindras sencys  10282
## a flight with coach and          2056
## best one read am on               206
## at the end of the                  21
## thanks for the shout out           10
## thank you for the rt                9
## let me know when you                8
## it^s going to be a                  7
## for a chance to win                 6
## can^t wait to see you               5
## buzz buzz buzz buzz buzz            4
## what do you think of                3
## in the middle of the                2
## keep up the good work               1
## [1] "Twitter-Dev Ngram types: 185630 , Total Ngrams: 185928 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 1.2 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                  
##                                             log10.col2 cum.%.vocab
## zz^s u no how i play                              2.00      100.00
## with a new owner your new                         1.98       94.99
## taking all of my notes with                       1.87       74.96
## lunch have been in the studio                     1.70       49.92
## flowing without no stopping he^s sweeter          1.40       24.88
## bava is now an obscure overlord                   1.00       10.00
## and argued w everyone grabbed a                   0.70        5.00
## a great weekend and a great                       0.00        1.00
## s o to all my followers                          -1.00        0.10
## hate people i hate people i                      -1.99        0.01
## and to washington dc for a                       -2.27        0.01
## amazing kisses i^m proud of be                   -2.31        0.00
## all you have to do is                            -2.37        0.00
## tyrantaylor sign them tyrantaylor sign them      -2.42        0.00
## thanks for the follow hope you                   -2.49        0.00
## i hate people i hate people                      -2.57        0.00
## during class don^t cry during class              -2.67        0.00
## don^t cry during class don^t cry                 -2.79        0.00
## cry during class don^t cry during                -2.97        0.00
## buzz buzz buzz buzz buzz buzz                    -3.27        0.00
##                                             cum.%.corpus freq   rank
## zz^s u no how i play                              100.00    1 185630
## with a new owner your new                          95.00    1 176334
## taking all of my notes with                        75.00    1 139148
## lunch have been in the studio                      50.00    1  92666
## flowing without no stopping he^s sweeter           25.00    1  46184
## bava is now an obscure overlord                    10.14    1  18563
## and argued w everyone grabbed a                     5.15    1   9282
## a great weekend and a great                         1.16    1   1856
## s o to all my followers                             0.23    2    186
## hate people i hate people i                         0.03    3     19
## and to washington dc for a                          0.02    3     10
## amazing kisses i^m proud of be                      0.02    3      9
## all you have to do is                               0.02    3      8
## tyrantaylor sign them tyrantaylor sign them         0.02    4      7
## thanks for the follow hope you                      0.01    4      6
## i hate people i hate people                         0.01    4      5
## during class don^t cry during class                 0.01    4      4
## don^t cry during class don^t cry                    0.01    4      3
## cry during class don^t cry during                   0.00    4      2
## buzz buzz buzz buzz buzz buzz                       0.00    5      1
## [1] "Twitter-Dev Ngram types: 166226 , Total Ngrams: 166389 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"                    
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n" 
## [4] "Many common ngrams contain 'Stopwords'."                  
##                                              log10.col2 cum.%.vocab
## zz^s u no how i play everyman                      2.00      100.00
## with a tallboy pbr or schlitz on                   1.98       95.00
## takes someone special to be a dad                  1.87       74.98
## macaroni in my bra tonight thanks baby             1.70       49.95
## follow please please please it will probably       1.40       24.93
## be a new back to the future                        1.00       10.00
## and become so jaded by life on                     0.70        5.00
## a hell of a series hey losing                      0.00        1.00
## -spread the word if you believe in                -1.00        0.10
## me please i love you su much                      -1.99        0.01
## during class don^t cry during class don^t         -2.22        0.01
## dc for a concert and meet-n-greet austintodc      -2.27        0.01
## class don^t cry during class don^t cry            -2.32        0.00
## change you re amazing kisses i^m proud            -2.38        0.00
## can you follow me please i love                   -2.44        0.00
## and to washington dc for a concert                -2.52        0.00
## amazing kisses i^m proud of be simpsonizer        -2.62        0.00
## don^t cry during class don^t cry during           -2.74        0.00
## cry during class don^t cry during class           -2.92        0.00
## buzz buzz buzz buzz buzz buzz buzz                -3.22        0.00
##                                              cum.%.corpus freq   rank
## zz^s u no how i play everyman                      100.00    1 166226
## with a tallboy pbr or schlitz on                    95.00    1 157907
## takes someone special to be a dad                   75.00    1 124629
## macaroni in my bra tonight thanks baby              50.00    1  83031
## follow please please please it will probably        25.00    1  41434
## be a new back to the future                         10.09    1  16623
## and become so jaded by life on                       5.09    1   8311
## a hell of a series hey losing                        1.10    1   1662
## -spread the word if you believe in                   0.20    1    166
## me please i love you su much                         0.03    3     17
## during class don^t cry during class don^t            0.02    3     10
## dc for a concert and meet-n-greet austintodc         0.02    3      9
## class don^t cry during class don^t cry               0.02    3      8
## change you re amazing kisses i^m proud               0.01    3      7
## can you follow me please i love                      0.01    3      6
## and to washington dc for a concert                   0.01    3      5
## amazing kisses i^m proud of be simpsonizer           0.01    3      4
## don^t cry during class don^t cry during              0.01    4      3
## cry during class don^t cry during class              0.00    4      2
## buzz buzz buzz buzz buzz buzz buzz                   0.00    4      1