Original “raw” Corpora
Some basic statistics about the raw Corpora
twitters |
2360148 |
213 |
2 |
news |
1010242 |
11384 |
1 |
blogs |
899288 |
40835 |
1 |
Sampled Corpora
For the creation of the language model 60% of the original corpora has been used. Six different random samples, 10% of the original corpora each, has been generated (seeds used for random sampling of the data c(19711004, 19760126, 19411016, 19430604, 19710425, 20020126)
).
Some Info about the Samples
Some basic statistics about the raw Corpora #Sample1
twitters |
236217 |
159 |
3 |
228117 |
news |
100873 |
2428 |
1 |
145430 |
blogs |
89864 |
37241 |
1 |
178650 |
Some basic statistics about the raw Corpora #Sample2
twitters |
236195 |
162 |
3 |
228140 |
news |
101157 |
11384 |
1 |
146716 |
blogs |
90098 |
37241 |
1 |
179971 |
Some basic statistics about the raw Corpora #Sample3
twitters |
236260 |
180 |
2 |
228403 |
news |
100741 |
5382 |
1 |
144736 |
blogs |
89717 |
7034 |
1 |
177999 |
Some basic statistics about the raw Corpora #Sample4
twitters |
235624 |
161 |
2 |
228260 |
news |
100456 |
5140 |
1 |
144187 |
blogs |
89428 |
9189 |
1 |
176270 |
Some basic statistics about the raw Corpora #Sample5
twitters |
235740 |
213 |
3 |
227686 |
news |
101129 |
5140 |
1 |
145898 |
blogs |
89939 |
5971 |
2 |
178017 |
Some basic statistics about the raw Corpora #Sample6
twitters |
236013 |
160 |
3 |
228291 |
news |
100972 |
8949 |
1 |
144988 |
blogs |
89910 |
14213 |
1 |
178445 |
A more detailed view of the Corpora (from Sample#1)
Some more information about unigrams, bigrams and trigrams for one of the sample, looking at the different corpus: twitter, news and blogs.
News
1-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

2-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
3011843 |
1017258 |
34222 |
716074 |

3-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
2866413 |
2132289 |
699083 |
1845648 |

Blogs
1-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

2-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
3439404 |
1012269 |
20278 |
668329 |

3-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
3260754 |
2268375 |
637998 |
1942300 |

Corpora (all samples)
Some more information about unigrams, bigrams and trigrams for all of the samples. Please note how the size of the number of tokens (N) and the vocabulary (V) changes and increase.
1-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

2-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
53549688 |
6905615 |
28129 |
2386330 |

3-grams


How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
50239484 |
20496858 |
1924741 |
15472910 |
