Original “raw” Corpora

Some basic statistics about the raw Corpora
sources noOfEntries maxNoOfChar minNoOfChar
twitters 2360148 213 2
news 1010242 11384 1
blogs 899288 40835 1

Sampled Corpora

For the creation of the language model 60% of the original corpora has been used. Six different random samples, 10% of the original corpora each, has been generated (seeds used for random sampling of the data c(19711004, 19760126, 19411016, 19430604, 19710425, 20020126)).

Some Info about the Samples

Some basic statistics about the raw Corpora #Sample1
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 236217 159 3 228117
news 100873 2428 1 145430
blogs 89864 37241 1 178650
Some basic statistics about the raw Corpora #Sample2
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 236195 162 3 228140
news 101157 11384 1 146716
blogs 90098 37241 1 179971
Some basic statistics about the raw Corpora #Sample3
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 236260 180 2 228403
news 100741 5382 1 144736
blogs 89717 7034 1 177999
Some basic statistics about the raw Corpora #Sample4
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 235624 161 2 228260
news 100456 5140 1 144187
blogs 89428 9189 1 176270
Some basic statistics about the raw Corpora #Sample5
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 235740 213 3 227686
news 101129 5140 1 145898
blogs 89939 5971 2 178017
Some basic statistics about the raw Corpora #Sample6
sources noOfEntries maxNoOfChar minNoOfChar noOfSentences
twitters 236013 160 3 228291
news 100972 8949 1 144988
blogs 89910 14213 1 178445

A more detailed view of the Corpora (from Sample#1)

Some more information about unigrams, bigrams and trigrams for one of the sample, looking at the different corpus: twitter, news and blogs.

Twitter

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2704169 78498 61 4125

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2476053 762320 15208 514715

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2247936 1554005 430037 1329212

News

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3157273 81101 115 6806

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3011843 1017258 34222 716074

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
2866413 2132289 699083 1845648

Blogs

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3618054 83849 67 5250

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3439404 1012269 20278 668329

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
3260754 2268375 637998 1942300

Corpora (all samples)

Some more information about unigrams, bigrams and trigrams for all of the samples. Please note how the size of the number of tokens (N) and the vocabulary (V) changes and increase.

1-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
56859890 362734 81 6141

2-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
53549688 6905615 28129 2386330

3-grams

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

N = number of tokens V = vocabulary size 50% coverage 90% coverage
50239484 20496858 1924741 15472910