This document outlines my strategy for implementing a next word predictor “app” that takes a phrase as input and suggests a possible next word. Natural language processing is new to me, so I reviewed the topic on wikipedia, coursera, R documentation, Google and Youtube. The corpus of text data for model developement was provided on the course web site. It comes from three sources: blogs, news and twitter. See appendix 1 for basic statistics.
I reviewed the input data files and decided to search for answers to the quiz questions. I extracted the last 10 characters from the incomplete phrases provided on quiz 2 and compiled a list of all matching blog documents. Then I extracted the next word from each matching document and tablulated the next word frequencies. Finally I printed out the four most common next words for each question. This approach was partially successful. It identified words that matched one of the quiz answer choices in 4 out of 10 questions, but the most frequent word only matched a choice in 2 out of 10 questions which included the common tri-grams: “case of beer” and “quite some time”. This was encouraging, especially since I only processed the blogs and presumably the news and twitter data would improve the results. See Appendix 1 for a table of the search results.
The corpus was divided into random sets for development, training, validation and testing (1,60, 20, 20 percent respectively). The development dataset is a subset of the training dataset.
Using the development dataset, I printed frequency plots and tables (instead of histograms) for word types and ngram types. A “word type” is simply a word in the vocabulary: e.g. ‘the’ is the most common word. An ngram type is simply an ngram: e.g. ‘one of the’ was the most common tri-gram in the blogs and news corpus, however ‘thanks for the’ was most common tri-gram in the twitter corpus. Word types and ngram types were ranked (sorted) by their frequency by corpus (blogs, news or twitter). Each plot corresponds to the following table so you can compare the shape of the curve in the plot to the numerical values in the table. They illustrate that the most common word types cover a very large percentage of the corpus size: 1% of the word types covered up to 60% of the corpus. For bi-grams, 1% covered up to 25% of the corpus. Consistent with Zipf’s law, the plots illustrate that the “log base 10 of the cummulative percent of word types” is nearly proportional (~ straigt line) to the percent of the corpus that they cover. “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.” (http://en.wikipedia.org/wiki/Zipf%27s_law). When stop words are removed the coverage is much less and does not follow Zipf’s law.
I would like to develop a set of features from the corpus. Each feature will have an associated model that outputs several next word predictions and probabilities. The set of predicted next words and probabilities will be given to a selection model (e.g. random forest) that selects the most likely next word. If I have time and resources, the set of features could include associations and frequencies for words and ngrams, word and ngram similarity, part of speech and sentiment.
##
## Basic Statistics
## file lineCount meanLength maxLength
## 1 ./final/en_US/en_US.twitter.txt 2360148 68.68045 140
## 2 ./final/en_US/en_US.blogs.txt 899288 229.987 40833
## 3 ./final/en_US/en_US.news.txt 1010242 201.1628 11384
## 4 ./final/en_US/sample.en_US.twitter.txt 2423 68.21461 140
## 5 ./final/en_US/sample.en_US.blogs.txt 895 227.6615 1972
## 6 ./final/en_US/sample.en_US.news.txt 1050 199.2533 884
## totalNchar
## 1 162096031
## 2 206824505
## 3 203223159
## 4 165284
## 5 203757
## 6 209216
##
## Search Results
## cgramText cgramCount nw1Text nw1Count nw2Text nw2Count nw3Text
## 1 a case of 251 beer 8 making 3 much
## 2 d mean the 31 difference 6 world 5 death
## 3 ake me the 25 ages 1 agi 1 best
## 4 ng but the 290 truth 8 best 7 kitchen
## 5 ate at the 141 end 15 same 9 time
## 6 d be on my 10 blog 1 family 1 feet
## 7 quite some 305 time 247 since 12 thing
## 8 his little 1488 girl 52 guy 44 blog
## 9 during the 5668 day 286 week 191 summer
## 10 ou must be 234 able 19 follower 8 wondering
## xLeft xRight y freq rank
## zzzzz 2.00 100.00 100.00 1 29481
## paused 1.63 42.53 95.00 2 12537
## absolute 1.00 10.00 83.90 11 2948
## therefore 0.61 4.06 75.00 32 1196
## quite 0.00 1.00 60.89 127 295
## here -0.44 0.36 49.95 431 107
## from -1.01 0.10 34.26 1451 29
## you -1.39 0.04 24.65 3026 12
## it -1.47 0.03 22.86 3983 10
## is -1.52 0.03 21.78 4225 9
## that -1.57 0.03 20.63 4566 8
## in -1.62 0.02 19.40 5998 7
## i -1.69 0.02 17.78 7759 6
## of -1.77 0.02 15.68 8755 5
## a -1.87 0.01 13.31 8938 4
## to -1.99 0.01 10.89 10582 3
## and -2.17 0.01 8.02 10928 2
## the -2.47 0.00 5.06 18712 1
## xLeft xRight y freq rank
## zyprexa 2.00 100.00 100.00 1 30628
## unveil 1.67 47.15 95.00 2 14441
## kelley 1.00 10.00 80.94 12 3063
## dangerous 0.78 6.00 75.00 21 1839
## keep 0.00 1.00 54.59 111 306
## night -0.20 0.63 50.01 166 194
## who -0.99 0.10 32.10 1089 31
## at -1.31 0.05 25.32 2108 15
## on -1.49 0.03 21.81 2621 10
## is -1.53 0.03 21.02 2776 9
## that -1.58 0.03 20.17 3289 8
## for -1.64 0.02 19.17 3474 7
## in -1.71 0.02 18.12 6638 6
## of -1.79 0.02 16.10 7574 5
## and -1.88 0.01 13.80 8549 4
## a -2.01 0.01 11.20 8622 3
## to -2.19 0.01 8.58 8875 2
## the -2.49 0.00 5.88 19347 1
## xLeft xRight y freq rank
## zz^s 2.00 100.00 100.00 1 25394
## valid 1.62 41.97 95.00 2 10658
## candidate 1.00 10.00 84.60 10 2539
## heat 0.57 3.72 75.00 33 945
## song 0.00 1.00 59.47 151 254
## first -0.31 0.49 50.00 315 124
## we -1.01 0.10 28.56 1505 25
## your -1.15 0.07 24.79 1733 18
## of -1.40 0.04 18.64 3553 10
## is -1.45 0.04 17.45 3593 9
## in -1.50 0.03 16.25 3772 8
## for -1.56 0.03 14.99 3846 7
## and -1.63 0.02 13.70 4507 6
## you -1.71 0.02 12.19 5787 5
## a -1.80 0.02 10.25 6083 4
## i -1.93 0.01 8.22 7186 3
## to -2.10 0.01 5.81 7987 2
## the -2.40 0.00 3.14 9374 1
## xLeft xRight y freq rank
## zzzzz 2.00 100.00 100.00 1 28939
## lone 1.87 74.05 95.00 1 21430
## michael^s 1.27 18.65 75.00 5 5396
## pen 1.00 10.00 63.52 10 2894
## heads 0.69 4.94 50.00 20 1431
## yesterday 0.02 1.05 24.98 63 305
## nature 0.00 1.00 24.30 65 289
## give -1.00 0.10 6.60 206 29
## work -1.46 0.03 3.38 379 10
## life -1.51 0.03 3.13 380 9
## made -1.56 0.03 2.87 401 8
## love -1.62 0.02 2.61 456 7
## day -1.68 0.02 2.30 474 6
## good -1.76 0.02 1.99 489 5
## back -1.86 0.01 1.66 522 4
## make -1.98 0.01 1.31 542 3
## people -2.16 0.01 0.95 581 2
## time -2.46 0.00 0.57 850 1
## xLeft xRight y freq rank
## zyprexa 2.00 100.00 100.00 1 30103
## lenovo 1.87 73.44 95.00 1 22107
## frustration 1.26 18.36 75.00 5 5527
## bombs 1.00 10.00 63.71 10 3010
## split 0.70 5.02 50.00 21 1511
## important 0.06 1.14 25.00 63 343
## force 0.00 1.00 23.28 68 301
## including -1.00 0.10 5.78 201 30
## make -1.48 0.03 2.62 312 10
## percent -1.52 0.03 2.43 338 9
## back -1.58 0.03 2.22 338 8
## school -1.63 0.02 2.00 344 7
## city -1.70 0.02 1.79 382 6
## years -1.78 0.02 1.55 458 5
## state -1.88 0.01 1.26 470 4
## people -2.00 0.01 0.97 477 3
## time -2.18 0.01 0.67 510 2
## year -2.48 0.00 0.35 564 1
## xLeft xRight y freq rank
## zz^s 2.00 100.00 100.00 1 24871
## melus 1.87 73.62 95.00 1 18311
## champion 1.19 15.42 75.00 5 3834
## ouch 1.00 10.00 67.96 9 2487
## david 0.54 3.45 50.00 25 859
## problem 0.00 1.00 31.52 71 249
## crazy -0.23 0.60 24.98 104 148
## year -1.00 0.10 10.29 273 25
## people -1.40 0.04 6.10 526 10
## back -1.44 0.04 5.70 579 9
## lol -1.49 0.03 5.26 717 8
## great -1.55 0.03 4.71 731 7
## today -1.62 0.02 4.16 750 6
## time -1.70 0.02 3.59 774 5
## rt -1.79 0.02 3.00 884 4
## day -1.92 0.01 2.32 930 3
## good -2.09 0.01 1.61 1023 2
## love -2.40 0.00 0.83 1093 1
## [1] "Blogs-Dev Ngram types: 186995 , Total Ngrams: 360610 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 25.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## żywiec archducal 2.00 100.00 100.00 1 186995
## through manuscript 1.96 90.36 95.00 1 168964
## house well 1.71 51.79 75.00 1 96842
## but how 1.05 11.29 50.00 2 21107
## actions of 1.00 10.00 48.67 2 18700
## corn syrup 0.70 5.00 40.33 4 9350
## about his 0.00 1.00 25.09 15 1870
## unable to 0.00 0.99 25.00 16 1850
## i want -1.00 0.10 10.53 86 187
## from the -1.99 0.01 3.44 393 19
## it was -2.27 0.01 2.34 481 10
## and i -2.32 0.00 2.21 485 9
## it is -2.37 0.00 2.08 490 8
## and the -2.43 0.00 1.94 593 7
## for the -2.49 0.00 1.78 608 6
## to be -2.57 0.00 1.61 718 5
## on the -2.67 0.00 1.41 747 4
## to the -2.79 0.00 1.20 902 3
## in the -2.97 0.00 0.95 1542 2
## of the -3.27 0.00 0.52 1889 1
## [1] "Blogs-Dev Ngram types: 308428 , Total Ngrams: 351929 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 7 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## żywiec archducal brewery 2.00 100.00 100.00 1 308428
## we^ve updated the 1.97 94.29 95.00 1 290832
## scanning exciter lamps 1.85 71.47 75.00 1 220446
## i reminds me 1.63 42.95 50.00 1 132463
## antics as she 1.16 14.42 25.00 1 44481
## air and the 1.00 10.00 21.12 1 30843
## see how much 0.70 5.00 15.49 2 15421
## it took a 0.00 1.00 7.04 4 3084
## and when i -1.00 0.10 2.12 13 308
## this is the -2.00 0.01 0.54 39 31
## this is a -2.49 0.00 0.24 67 10
## the rest of -2.53 0.00 0.22 67 9
## some of the -2.59 0.00 0.20 69 8
## out of the -2.64 0.00 0.18 71 7
## it was a -2.71 0.00 0.16 71 6
## a couple of -2.79 0.00 0.14 77 5
## as well as -2.89 0.00 0.12 78 4
## to be a -3.01 0.00 0.10 79 3
## a lot of -3.19 0.00 0.08 121 2
## one of the -3.49 0.00 0.04 154 1
## [1] "Blogs-Dev Ngram types: 336294 , Total Ngrams: 343546 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 2.7 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq
## żywiec archducal brewery perhaps 2.00 100.00 100.00 1
## when asked why i 1.98 94.89 95.00 1
## stumble upon friend feed 1.87 74.46 75.00 1
## live in and how 1.69 48.92 50.00 1
## did take over animals 1.37 23.38 25.00 1
## ante or that such 1.00 10.00 11.90 1
## after you add your 0.70 5.00 7.01 1
## or the beginning of 0.00 1.00 2.67 2
## if you want a -1.00 0.10 0.65 4
## for the rest of -2.00 0.01 0.17 11
## when it comes to -2.53 0.00 0.07 18
## is one of the -2.57 0.00 0.07 19
## for the first time -2.62 0.00 0.06 19
## i am going to -2.68 0.00 0.06 20
## on the other hand -2.75 0.00 0.05 21
## one of the most -2.83 0.00 0.04 23
## the end of the -2.92 0.00 0.04 29
## at the same time -3.05 0.00 0.03 30
## at the end of -3.23 0.00 0.02 34
## the rest of the -3.53 0.00 0.01 37
## rank
## żywiec archducal brewery perhaps 336294
## when asked why i 319117
## stumble upon friend feed 250407
## live in and how 164521
## did take over animals 78634
## ante or that such 33629
## after you add your 16815
## or the beginning of 3363
## if you want a 336
## for the rest of 34
## when it comes to 10
## is one of the 9
## for the first time 8
## i am going to 7
## on the other hand 6
## one of the most 5
## the end of the 4
## at the same time 3
## at the end of 2
## the rest of the 1
## [1] "Blogs-Dev Ngram types: 334204 , Total Ngrams: 335516 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.4 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus
## żywiec archducal brewery perhaps the 2.00 100.00 100.00
## when my name was called 1.98 94.98 95.00
## taking pictures of the championship 1.87 74.90 75.00
## makes some sort of sense 1.70 49.80 50.00
## elegance examples of which we 1.39 24.71 25.00
## area to hunt the birds 1.00 10.00 10.35
## also no information on the 0.70 5.00 5.37
## a few questions to be 0.00 1.00 1.39
## every minute of it i -1.00 0.10 0.28
## i don^t want to be -2.01 0.01 0.06
## for those of you who -2.52 0.00 0.03
## the rest of the world -2.57 0.00 0.02
## it was going to be -2.62 0.00 0.02
## at the same time i -2.68 0.00 0.02
## the fairmont hotel in kansas -2.75 0.00 0.02
## out keep freaking out keep -2.83 0.00 0.02
## fairmont hotel in kansas city -2.92 0.00 0.01
## keep freaking out keep freaking -3.05 0.00 0.01
## freaking out keep freaking out -3.22 0.00 0.01
## at the end of the -3.52 0.00 0.01
## freq rank
## żywiec archducal brewery perhaps the 1 334204
## when my name was called 1 317428
## taking pictures of the championship 1 250325
## makes some sort of sense 1 166446
## elegance examples of which we 1 82567
## area to hunt the birds 1 33420
## also no information on the 1 16710
## a few questions to be 1 3342
## every minute of it i 2 334
## i don^t want to be 4 33
## for those of you who 6 10
## the rest of the world 7 9
## it was going to be 7 8
## at the same time i 7 7
## the fairmont hotel in kansas 8 6
## out keep freaking out keep 8 5
## fairmont hotel in kansas city 8 4
## keep freaking out keep freaking 9 3
## freaking out keep freaking out 9 2
## at the end of the 20 1
## [1] "Blogs-Dev Ngram types: 327484 , Total Ngrams: 327810 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2
## żywiec archducal brewery perhaps the only 2.00
## when necessary to keep an eye 1.98
## tasting session in the classroom bakerie^s 1.87
## many fruit notes that filled the 1.70
## enjoying several rare yellow-billed oxpeckers amongst 1.40
## art by john harris macmillan audio 1.00
## am glad for the day after 0.70
## a great time meals were ok 0.00
## -month-old great-nephew came to live with -1.00
## all the light fixtures in the -2.00
## there a left wing talking head -2.52
## it was going to be a -2.56
## is there a left wing talking -2.61
## fairmont hotel in kansas city of -2.67
## the grand theater bismarck north dakota -2.74
## at the end of the day -2.82
## the fairmont hotel in kansas city -2.91
## out keep freaking out keep freaking -3.04
## freaking out keep freaking out keep -3.21
## keep freaking out keep freaking out -3.52
## cum.%.vocab
## żywiec archducal brewery perhaps the only 100.00
## when necessary to keep an eye 94.99
## tasting session in the classroom bakerie^s 74.97
## many fruit notes that filled the 49.95
## enjoying several rare yellow-billed oxpeckers amongst 24.93
## art by john harris macmillan audio 10.00
## am glad for the day after 5.00
## a great time meals were ok 1.00
## -month-old great-nephew came to live with 0.10
## all the light fixtures in the 0.01
## there a left wing talking head 0.00
## it was going to be a 0.00
## is there a left wing talking 0.00
## fairmont hotel in kansas city of 0.00
## the grand theater bismarck north dakota 0.00
## at the end of the day 0.00
## the fairmont hotel in kansas city 0.00
## out keep freaking out keep freaking 0.00
## freaking out keep freaking out keep 0.00
## keep freaking out keep freaking out 0.00
## cum.%.corpus freq
## żywiec archducal brewery perhaps the only 100.00 1
## when necessary to keep an eye 95.00 1
## tasting session in the classroom bakerie^s 75.00 1
## many fruit notes that filled the 50.00 1
## enjoying several rare yellow-billed oxpeckers amongst 25.00 1
## art by john harris macmillan audio 10.09 1
## am glad for the day after 5.09 1
## a great time meals were ok 1.10 1
## -month-old great-nephew came to live with 0.20 1
## all the light fixtures in the 0.03 2
## there a left wing talking head 0.02 4
## it was going to be a 0.02 4
## is there a left wing talking 0.02 4
## fairmont hotel in kansas city of 0.01 4
## the grand theater bismarck north dakota 0.01 5
## at the end of the day 0.01 5
## the fairmont hotel in kansas city 0.01 8
## out keep freaking out keep freaking 0.01 8
## freaking out keep freaking out keep 0.01 8
## keep freaking out keep freaking out 0.00 9
## rank
## żywiec archducal brewery perhaps the only 327484
## when necessary to keep an eye 311093
## tasting session in the classroom bakerie^s 245531
## many fruit notes that filled the 163579
## enjoying several rare yellow-billed oxpeckers amongst 81626
## art by john harris macmillan audio 32748
## am glad for the day after 16374
## a great time meals were ok 3275
## -month-old great-nephew came to live with 327
## all the light fixtures in the 33
## there a left wing talking head 10
## it was going to be a 9
## is there a left wing talking 8
## fairmont hotel in kansas city of 7
## the grand theater bismarck north dakota 6
## at the end of the day 5
## the fairmont hotel in kansas city 4
## out keep freaking out keep freaking 3
## freaking out keep freaking out keep 2
## keep freaking out keep freaking out 1
## [1] "Blogs-Dev Ngram types: 320289 , Total Ngrams: 320419 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2
## żywiec archducal brewery perhaps the only one 2.00
## when marty became borough president in one 1.98
## taxes and pyramids and diamonds and oligarchy 1.88
## many pills and bags of cocaine she 1.70
## entire country will be hostile to it 1.40
## artistically-minded baby who wants to decorate the 1.00
## am not married to a cowboy but 0.70
## a heart shaped cherry in the design 0.00
## a batch of brainless belgian golden ale -1.00
## development events coordinator is responsible for the -2.00
## a pair of shoes you have to -2.51
## a good range of eventualities that can -2.55
## a floured surface with a rolling pin -2.60
## refunds for price drops of or more -2.66
## pins are all of the places i^ve -2.73
## the fairmont hotel in kansas city of -2.81
## is there a left wing talking head -2.90
## out keep freaking out keep freaking out -3.03
## keep freaking out keep freaking out keep -3.20
## freaking out keep freaking out keep freaking -3.51
## cum.%.vocab
## żywiec archducal brewery perhaps the only one 100.00
## when marty became borough president in one 95.00
## taxes and pyramids and diamonds and oligarchy 74.99
## many pills and bags of cocaine she 49.98
## entire country will be hostile to it 24.97
## artistically-minded baby who wants to decorate the 10.00
## am not married to a cowboy but 5.00
## a heart shaped cherry in the design 1.00
## a batch of brainless belgian golden ale 0.10
## development events coordinator is responsible for the 0.01
## a pair of shoes you have to 0.00
## a good range of eventualities that can 0.00
## a floured surface with a rolling pin 0.00
## refunds for price drops of or more 0.00
## pins are all of the places i^ve 0.00
## the fairmont hotel in kansas city of 0.00
## is there a left wing talking head 0.00
## out keep freaking out keep freaking out 0.00
## keep freaking out keep freaking out keep 0.00
## freaking out keep freaking out keep freaking 0.00
## cum.%.corpus freq
## żywiec archducal brewery perhaps the only one 100.00 1
## when marty became borough president in one 95.00 1
## taxes and pyramids and diamonds and oligarchy 75.00 1
## many pills and bags of cocaine she 50.00 1
## entire country will be hostile to it 25.00 1
## artistically-minded baby who wants to decorate the 10.04 1
## am not married to a cowboy but 5.04 1
## a heart shaped cherry in the design 1.04 1
## a batch of brainless belgian golden ale 0.14 1
## development events coordinator is responsible for the 0.03 2
## a pair of shoes you have to 0.01 2
## a good range of eventualities that can 0.01 2
## a floured surface with a rolling pin 0.01 2
## refunds for price drops of or more 0.01 3
## pins are all of the places i^ve 0.01 3
## the fairmont hotel in kansas city of 0.01 4
## is there a left wing talking head 0.01 4
## out keep freaking out keep freaking out 0.01 8
## keep freaking out keep freaking out keep 0.00 8
## freaking out keep freaking out keep freaking 0.00 8
## rank
## żywiec archducal brewery perhaps the only one 320289
## when marty became borough president in one 304268
## taxes and pyramids and diamonds and oligarchy 240184
## many pills and bags of cocaine she 160079
## entire country will be hostile to it 79975
## artistically-minded baby who wants to decorate the 32029
## am not married to a cowboy but 16014
## a heart shaped cherry in the design 3203
## a batch of brainless belgian golden ale 320
## development events coordinator is responsible for the 32
## a pair of shoes you have to 10
## a good range of eventualities that can 9
## a floured surface with a rolling pin 8
## refunds for price drops of or more 7
## pins are all of the places i^ve 6
## the fairmont hotel in kansas city of 5
## is there a left wing talking head 4
## out keep freaking out keep freaking out 3
## keep freaking out keep freaking out keep 2
## freaking out keep freaking out keep freaking 1
## [1] "News-Dev Ngram types: 183857 , Total Ngrams: 319060 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 21.4 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## zynga inc 2.00 100.00 100.00 1 183857
## to around 1.96 91.32 95.00 1 167904
## kansas won 1.75 56.62 75.00 1 104092
## sponsor of 1.19 15.44 50.00 2 28382
## better as 1.00 10.00 43.73 2 18386
## shift in 0.70 5.00 35.58 4 9193
## the press 0.21 1.62 25.00 9 2973
## and told 0.00 1.00 21.35 12 1839
## number of -1.00 0.10 9.25 64 184
## will be -2.01 0.01 3.38 280 18
## with the -2.26 0.01 2.61 436 10
## to be -2.31 0.00 2.47 445 9
## and the -2.36 0.00 2.33 497 8
## in a -2.42 0.00 2.17 507 7
## at the -2.49 0.00 2.01 620 6
## for the -2.57 0.00 1.82 678 5
## on the -2.66 0.00 1.61 759 4
## to the -2.79 0.00 1.37 840 3
## in the -2.96 0.00 1.11 1678 2
## of the -3.26 0.00 0.58 1853 1
## [1] "News-Dev Ngram types: 279094 , Total Ngrams: 309157 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 6 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## zynga inc venture 2.00 100.00 100.00 1 279094
## was willfully blind 1.98 94.46 95.00 1 263636
## scale given an 1.86 72.31 75.00 1 201805
## in saw values 1.65 44.61 50.00 1 124515
## baylor shooter brady 1.23 16.92 25.00 1 47226
## also understand i 1.00 10.00 18.75 1 27909
## way that had 0.70 5.00 14.01 2 13955
## and that^s a 0.00 1.00 6.05 3 2791
## that would be -1.00 0.10 1.79 11 279
## a couple of -2.00 0.01 0.46 33 28
## as well as -2.45 0.00 0.24 51 10
## going to be -2.49 0.00 0.22 54 9
## the end of -2.54 0.00 0.20 55 8
## some of the -2.60 0.00 0.18 55 7
## part of the -2.67 0.00 0.17 55 6
## in the first -2.75 0.00 0.15 55 5
## according to the -2.84 0.00 0.13 64 4
## a lot of -2.97 0.00 0.11 96 3
## the u s -3.14 0.00 0.08 104 2
## one of the -3.45 0.00 0.05 140 1
## [1] "News-Dev Ngram types: 294482 , Total Ngrams: 299469 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 2.5 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq
## zynga inc venture firms 2.00 100.00 100.00 1
## weighty subject of racial 1.98 94.92 95.00 1
## staple gun and then 1.87 74.58 75.00 1
## lost her home scores 1.69 49.15 50.00 1
## defined contribution plan but 1.38 23.73 25.00 1
## anywhere at the center 1.00 10.00 11.50 1
## after a -month moratorium 0.70 5.00 6.58 1
## to be with her 0.00 1.00 2.50 2
## a win over the -1.00 0.10 0.59 3
## to the u s -2.01 0.01 0.16 11
## said in a statement -2.47 0.00 0.08 17
## at the university of -2.51 0.00 0.07 17
## at the same time -2.57 0.00 0.06 19
## in the u s -2.62 0.00 0.06 20
## a m to p -2.69 0.00 0.05 20
## we re going to -2.77 0.00 0.04 23
## m to p m -2.87 0.00 0.04 23
## at the end of -2.99 0.00 0.03 25
## for the first time -3.17 0.00 0.02 31
## the end of the -3.47 0.00 0.01 32
## rank
## zynga inc venture firms 294482
## weighty subject of racial 279509
## staple gun and then 219615
## lost her home scores 144747
## defined contribution plan but 69880
## anywhere at the center 29448
## after a -month moratorium 14724
## to be with her 2945
## a win over the 294
## to the u s 29
## said in a statement 10
## at the university of 9
## at the same time 8
## in the u s 7
## a m to p 6
## we re going to 5
## m to p m 4
## at the end of 3
## for the first time 2
## the end of the 1
## [1] "News-Dev Ngram types: 289052 , Total Ngrams: 289991 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.3 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq
## zynga inc venture firms doubt 2.00 100.00 100.00 1
## were able to get the 1.98 94.98 95.00 1
## story set in the tumultuous 1.87 74.92 75.00 1
## market making and trading across 1.70 49.84 50.00 1
## doubts caused by their late 1.39 24.76 25.00 1
## ariz they^ll do so with 1.00 10.00 10.29 1
## all the stops to praise 0.70 5.00 5.31 1
## a draw by majority decision 0.00 1.00 1.32 1
## has been working with the -1.00 0.10 0.27 2
## president of the united states -2.00 0.01 0.06 4
## not be reached for comment -2.46 0.00 0.03 6
## for the first time in -2.51 0.00 0.03 6
## could not be reached for -2.56 0.00 0.03 6
## at the time of the -2.62 0.00 0.03 6
## the end of the year -2.68 0.00 0.02 7
## in the middle of the -2.76 0.00 0.02 7
## by the end of the -2.86 0.00 0.02 8
## from a m to p -2.98 0.00 0.02 10
## at the end of the -3.16 0.00 0.01 16
## a m to p m -3.46 0.00 0.01 20
## rank
## zynga inc venture firms doubt 289052
## were able to get the 274552
## story set in the tumultuous 216554
## market making and trading across 144056
## doubts caused by their late 71559
## ariz they^ll do so with 28905
## all the stops to praise 14453
## a draw by majority decision 2891
## has been working with the 289
## president of the united states 29
## not be reached for comment 10
## for the first time in 9
## could not be reached for 8
## at the time of the 7
## the end of the year 6
## in the middle of the 5
## by the end of the 4
## from a m to p 3
## at the end of the 2
## a m to p m 1
## [1] "News-Dev Ngram types: 280423 , Total Ngrams: 280670 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab
## zynga inc venture firms doubt that 2.00 100.00
## were approved one program created in 1.98 95.00
## stretches of mostly isolated beach broken 1.87 74.98
## maw breaks the surface grabs the 1.70 49.96
## drug war but with an enhanced 1.40 24.93
## art silicon valley law foundation child 1.00 10.00
## almost joined the playoff as she 0.70 5.00
## a freelance writer in south russell 0.00 1.00
## -foot- -pound lineman jerry deloach was -1.00 0.10
## aggravated battery to a child causing -2.00 0.01
## at the end of the month -2.45 0.00
## at the end of the day -2.49 0.00
## at the beginning of the year -2.54 0.00
## and a m to p m -2.60 0.00
## -year-old resident of the block of -2.67 0.00
## a m to p m saturday -2.75 0.00
## by the end of the year -2.85 0.00
## we re going to have to -2.97 0.00
## could not be reached for comment -3.15 0.00
## from a m to p m -3.45 0.00
## cum.%.corpus freq rank
## zynga inc venture firms doubt that 100.00 1 280423
## were approved one program created in 95.00 1 266389
## stretches of mostly isolated beach broken 75.00 1 210255
## maw breaks the surface grabs the 50.00 1 140088
## drug war but with an enhanced 25.00 1 69920
## art silicon valley law foundation child 10.08 1 28042
## almost joined the playoff as she 5.08 1 14021
## a freelance writer in south russell 1.09 1 2804
## -foot- -pound lineman jerry deloach was 0.19 1 280
## aggravated battery to a child causing 0.03 2 28
## at the end of the month 0.02 3 10
## at the end of the day 0.02 3 9
## at the beginning of the year 0.01 3 8
## and a m to p m 0.01 3 7
## -year-old resident of the block of 0.01 3 6
## a m to p m saturday 0.01 4 5
## by the end of the year 0.01 5 4
## we re going to have to 0.01 6 3
## could not be reached for comment 0.01 6 2
## from a m to p m 0.00 10 1
## [1] "News-Dev Ngram types: 271391 , Total Ngrams: 271486 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab
## zynga inc venture firms doubt that the 2.00 100.00
## were anxious about the coming budget deliberations 1.98 95.00
## striking down the individual mandate the supreme 1.88 74.99
## may have been involved ray scott reminded 1.70 49.98
## dum maaro dum another ambitious screen effort 1.40 24.97
## artists to increasing its season subscriptions and 1.00 10.00
## along a remote new york beach highway 0.70 5.00
## a game here at their team dinners 0.00 1.00
## -on- workouts after surgery to repair torn -1.00 0.10
## fat g saturated mg cholesterol mg sodium -2.00 0.01
## anonymity because he was not authorized to -2.43 0.00
## angry that silva was planning to leave -2.48 0.00
## and killed her on oct hours before -2.53 0.00
## and ask her friend how she would -2.59 0.00
## and a m to p m saturday -2.66 0.00
## a -year-old resident of the block of -2.73 0.00
## the centers for disease control and prevention -2.83 0.00
## protein g carbohydrate g fat g saturated -2.96 0.00
## g protein g carbohydrate g fat g -3.13 0.00
## calories g protein g carbohydrate g fat -3.43 0.00
## cum.%.corpus freq
## zynga inc venture firms doubt that the 100.00 1
## were anxious about the coming budget deliberations 95.00 1
## striking down the individual mandate the supreme 75.00 1
## may have been involved ray scott reminded 50.00 1
## dum maaro dum another ambitious screen effort 25.00 1
## artists to increasing its season subscriptions and 10.03 1
## along a remote new york beach highway 5.03 1
## a game here at their team dinners 1.03 1
## -on- workouts after surgery to repair torn 0.13 1
## fat g saturated mg cholesterol mg sodium 0.02 2
## anonymity because he was not authorized to 0.01 2
## angry that silva was planning to leave 0.01 2
## and killed her on oct hours before 0.01 2
## and ask her friend how she would 0.01 2
## and a m to p m saturday 0.01 2
## a -year-old resident of the block of 0.01 2
## the centers for disease control and prevention 0.00 3
## protein g carbohydrate g fat g saturated 0.00 3
## g protein g carbohydrate g fat g 0.00 3
## calories g protein g carbohydrate g fat 0.00 3
## rank
## zynga inc venture firms doubt that the 271391
## were anxious about the coming budget deliberations 257817
## striking down the individual mandate the supreme 203519
## may have been involved ray scott reminded 135648
## dum maaro dum another ambitious screen effort 67776
## artists to increasing its season subscriptions and 27139
## along a remote new york beach highway 13570
## a game here at their team dinners 2714
## -on- workouts after surgery to repair torn 271
## fat g saturated mg cholesterol mg sodium 27
## anonymity because he was not authorized to 10
## angry that silva was planning to leave 9
## and killed her on oct hours before 8
## and ask her friend how she would 7
## and a m to p m saturday 6
## a -year-old resident of the block of 5
## the centers for disease control and prevention 4
## protein g carbohydrate g fat g saturated 3
## g protein g carbohydrate g fat g 2
## calories g protein g carbohydrate g fat 1
## [1] "Twitter-Dev Ngram types: 147556 , Total Ngrams: 275117 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 23.8 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## zz^s u 2.00 100.00 100.00 1 147556
## top bloke 1.96 90.68 95.00 1 133800
## is piercing 1.73 53.39 75.00 1 78777
## him once 1.09 12.29 50.00 2 18138
## at sxsw 1.00 10.00 47.54 2 14756
## has more 0.70 5.00 39.38 4 7378
## me my 0.06 1.16 25.00 14 1705
## way of 0.00 1.00 23.78 16 1476
## i like -1.00 0.10 9.26 83 148
## thank you -1.99 0.01 2.41 297 15
## if you -2.17 0.01 1.80 363 10
## i love -2.21 0.01 1.67 383 9
## you re -2.27 0.01 1.53 390 8
## thanks for -2.32 0.00 1.39 426 7
## to the -2.39 0.00 1.24 434 6
## to be -2.47 0.00 1.08 461 5
## on the -2.57 0.00 0.91 461 4
## of the -2.69 0.00 0.74 531 3
## for the -2.87 0.00 0.55 739 2
## in the -3.17 0.00 0.28 776 1
## [1] "Twitter-Dev Ngram types: 220586 , Total Ngrams: 251535 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 7.4 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq rank
## zz^s u no 2.00 100.00 100.00 1 220586
## where i^m from 1.97 94.30 95.00 1 208009
## scene am tired 1.85 71.49 75.00 1 157702
## ifitellyou one thing 1.63 42.98 50.00 1 94818
## at webbinno on 1.16 14.48 25.00 1 31935
## also tracy needed 1.00 10.00 21.07 1 22059
## that and i 0.70 5.00 15.73 2 11029
## is at the 0.00 1.00 7.38 4 2206
## i think we -1.00 0.10 2.28 14 221
## would love to -2.00 0.01 0.61 48 22
## i have a -2.34 0.00 0.36 60 10
## a lot of -2.39 0.00 0.34 63 9
## thank you for -2.44 0.00 0.31 67 8
## i need to -2.50 0.00 0.29 68 7
## can^t wait to -2.57 0.00 0.26 69 6
## going to be -2.64 0.00 0.23 74 5
## for the follow -2.74 0.00 0.20 82 4
## looking forward to -2.87 0.00 0.17 90 3
## i love you -3.04 0.00 0.13 91 2
## thanks for the -3.34 0.00 0.10 248 1
## [1] "Twitter-Dev Ngram types: 223002 , Total Ngrams: 228572 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 2.9 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq
## zz^s u no how 2.00 100.00 100.00 1
## wines for the kids 1.98 94.87 95.00 1
## stop playing the game 1.87 74.38 75.00 1
## liking this author new 1.69 48.75 50.00 1
## east coast devastation hope 1.36 23.13 25.00 1
## at pm tune in 1.00 10.00 12.19 1
## all cumfied out for 0.70 5.00 7.31 1
## look at the banana 0.00 1.00 2.88 2
## day to all the -1.00 0.10 0.76 4
## hope to see you -2.01 0.01 0.20 12
## for the first time -2.35 0.00 0.13 18
## have a great day -2.39 0.00 0.12 20
## at the same time -2.45 0.00 0.11 20
## can^t wait to see -2.50 0.00 0.10 22
## thank you so much -2.57 0.00 0.09 23
## the rest of the -2.65 0.00 0.08 24
## thank you for the -2.75 0.00 0.07 25
## is going to be -2.87 0.00 0.06 25
## thanks for the rt -3.05 0.00 0.05 42
## thanks for the follow -3.35 0.00 0.03 70
## rank
## zz^s u no how 223002
## wines for the kids 211573
## stop playing the game 165859
## liking this author new 108716
## east coast devastation hope 51573
## at pm tune in 22300
## all cumfied out for 11150
## look at the banana 2230
## day to all the 223
## hope to see you 22
## for the first time 10
## have a great day 9
## at the same time 8
## can^t wait to see 7
## thank you so much 6
## the rest of the 5
## thank you for the 4
## is going to be 3
## thanks for the rt 2
## thanks for the follow 1
## [1] "Twitter-Dev Ngram types: 205635 , Total Ngrams: 206675 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.5 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab cum.%.corpus freq
## zz^s u no how i 2.00 100.00 100.00 1
## with a comfortable minutes to 1.98 94.97 95.00 1
## tab and htc evo shift 1.87 74.87 75.00 1
## love to the beastie boys 1.70 49.75 50.00 1
## final page of the book 1.39 24.62 25.00 1
## bakker nor any other minister 1.00 10.00 10.45 1
## an exciting day kindras sencys 0.70 5.00 5.48 1
## a flight with coach and 0.00 1.00 1.50 1
## best one read am on -1.00 0.10 0.31 2
## at the end of the -1.99 0.01 0.06 4
## thanks for the shout out -2.31 0.00 0.03 6
## thank you for the rt -2.36 0.00 0.03 6
## let me know when you -2.41 0.00 0.03 6
## it^s going to be a -2.47 0.00 0.02 6
## for a chance to win -2.53 0.00 0.02 6
## can^t wait to see you -2.61 0.00 0.02 6
## buzz buzz buzz buzz buzz -2.71 0.00 0.01 6
## what do you think of -2.84 0.00 0.01 7
## in the middle of the -3.01 0.00 0.01 7
## keep up the good work -3.31 0.00 0.00 10
## rank
## zz^s u no how i 205635
## with a comfortable minutes to 195301
## tab and htc evo shift 153966
## love to the beastie boys 102297
## final page of the book 50629
## bakker nor any other minister 20564
## an exciting day kindras sencys 10282
## a flight with coach and 2056
## best one read am on 206
## at the end of the 21
## thanks for the shout out 10
## thank you for the rt 9
## let me know when you 8
## it^s going to be a 7
## for a chance to win 6
## can^t wait to see you 5
## buzz buzz buzz buzz buzz 4
## what do you think of 3
## in the middle of the 2
## keep up the good work 1
## [1] "Twitter-Dev Ngram types: 185630 , Total Ngrams: 185928 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.2 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab
## zz^s u no how i play 2.00 100.00
## with a new owner your new 1.98 94.99
## taking all of my notes with 1.87 74.96
## lunch have been in the studio 1.70 49.92
## flowing without no stopping he^s sweeter 1.40 24.88
## bava is now an obscure overlord 1.00 10.00
## and argued w everyone grabbed a 0.70 5.00
## a great weekend and a great 0.00 1.00
## s o to all my followers -1.00 0.10
## hate people i hate people i -1.99 0.01
## and to washington dc for a -2.27 0.01
## amazing kisses i^m proud of be -2.31 0.00
## all you have to do is -2.37 0.00
## tyrantaylor sign them tyrantaylor sign them -2.42 0.00
## thanks for the follow hope you -2.49 0.00
## i hate people i hate people -2.57 0.00
## during class don^t cry during class -2.67 0.00
## don^t cry during class don^t cry -2.79 0.00
## cry during class don^t cry during -2.97 0.00
## buzz buzz buzz buzz buzz buzz -3.27 0.00
## cum.%.corpus freq rank
## zz^s u no how i play 100.00 1 185630
## with a new owner your new 95.00 1 176334
## taking all of my notes with 75.00 1 139148
## lunch have been in the studio 50.00 1 92666
## flowing without no stopping he^s sweeter 25.00 1 46184
## bava is now an obscure overlord 10.14 1 18563
## and argued w everyone grabbed a 5.15 1 9282
## a great weekend and a great 1.16 1 1856
## s o to all my followers 0.23 2 186
## hate people i hate people i 0.03 3 19
## and to washington dc for a 0.02 3 10
## amazing kisses i^m proud of be 0.02 3 9
## all you have to do is 0.02 3 8
## tyrantaylor sign them tyrantaylor sign them 0.02 4 7
## thanks for the follow hope you 0.01 4 6
## i hate people i hate people 0.01 4 5
## during class don^t cry during class 0.01 4 4
## don^t cry during class don^t cry 0.01 4 3
## cry during class don^t cry during 0.00 4 2
## buzz buzz buzz buzz buzz buzz 0.00 5 1
## [1] "Twitter-Dev Ngram types: 166226 , Total Ngrams: 166389 \n"
## [2] "Vocabulary Ngrams Sorted by Frequency"
## [3] "Top 1 % of ngram types account for 1.1 % total ngrams.\n"
## [4] "Many common ngrams contain 'Stopwords'."
## log10.col2 cum.%.vocab
## zz^s u no how i play everyman 2.00 100.00
## with a tallboy pbr or schlitz on 1.98 95.00
## takes someone special to be a dad 1.87 74.98
## macaroni in my bra tonight thanks baby 1.70 49.95
## follow please please please it will probably 1.40 24.93
## be a new back to the future 1.00 10.00
## and become so jaded by life on 0.70 5.00
## a hell of a series hey losing 0.00 1.00
## -spread the word if you believe in -1.00 0.10
## me please i love you su much -1.99 0.01
## during class don^t cry during class don^t -2.22 0.01
## dc for a concert and meet-n-greet austintodc -2.27 0.01
## class don^t cry during class don^t cry -2.32 0.00
## change you re amazing kisses i^m proud -2.38 0.00
## can you follow me please i love -2.44 0.00
## and to washington dc for a concert -2.52 0.00
## amazing kisses i^m proud of be simpsonizer -2.62 0.00
## don^t cry during class don^t cry during -2.74 0.00
## cry during class don^t cry during class -2.92 0.00
## buzz buzz buzz buzz buzz buzz buzz -3.22 0.00
## cum.%.corpus freq rank
## zz^s u no how i play everyman 100.00 1 166226
## with a tallboy pbr or schlitz on 95.00 1 157907
## takes someone special to be a dad 75.00 1 124629
## macaroni in my bra tonight thanks baby 50.00 1 83031
## follow please please please it will probably 25.00 1 41434
## be a new back to the future 10.09 1 16623
## and become so jaded by life on 5.09 1 8311
## a hell of a series hey losing 1.10 1 1662
## -spread the word if you believe in 0.20 1 166
## me please i love you su much 0.03 3 17
## during class don^t cry during class don^t 0.02 3 10
## dc for a concert and meet-n-greet austintodc 0.02 3 9
## class don^t cry during class don^t cry 0.02 3 8
## change you re amazing kisses i^m proud 0.01 3 7
## can you follow me please i love 0.01 3 6
## and to washington dc for a concert 0.01 3 5
## amazing kisses i^m proud of be simpsonizer 0.01 3 4
## don^t cry during class don^t cry during 0.01 4 3
## cry during class don^t cry during class 0.00 4 2
## buzz buzz buzz buzz buzz buzz buzz 0.00 4 1