At this (early) stage of the project, we are only now trying to find a firm grasp on what we don’t know. We have been provided with three different corpora1 The lack of agreement of “corpuses” over “corpora” in the English language is just one more example of challenges in making our text prediction system., each of which appears to provide more than sufficient information to build interesting prediction systems. Each corpus is rather deep (lots of material, ~35 million words each), and rich (lots of variety, ~300,000 unique words in each.
The bulk of the early work dealt with finding reasonable methods of parsing the corpora into discrete ‘words’ and ‘phrases’ – for this project we have made simplistic decisions about how to identify what separates words (mostly white-space), and what separates phrases from each other (most punctuation other than commas). With a set of simple decisions, we have been able to craft a reasonable set of scripts to implement these decisions. The result is a separate freestanding database for each corpus, each holding n-gram2 This work relies heavily on n-grams as a simplifying structure to hold our language information data plus a summary table of facts of interest.
The statistics quoted in this report are from what we have learned by reviewing the resulting databases.
Wrangling the available corpora into directly usable form is the first challenge to this work, and it is necessary before we can even begin to review the many megabytes of input.
The approach used was to first work out the steps necessary to ‘clean’3 To clean: start with raw unicode text; strip out all control characters, special characters, and symbols; remove all punctuation other than apostrophes inside words and sentence endings (which are converted into simple periods); finally simplify all separations into a single white space each. each line of text.
Once cleaned, each line is then processed to extract n-grams. The n-grams are used to build associative tables holding the counts for each time an n-gram is encountered in the corpus. At the end of processing these tables are saved as databases for later use.
Note: A small number of words are scattered throughout these files that are not pure ASCII, e.g. ‘café’. However, the affected words are sufficently limited that complete handling of unicode may not greatly affect predictions.
Still, there are many words that include apostrophes, so predictions may suffer if the cleaning mechanisms are too simplistic.
We have been provided with three corpora4 The data provided are from the English US Corpus at HC Corpora though this work is based on a class specific download..
* English US Blogs
* English US Newspapers
* English US Twitter
Each corpus is provided as a single file, where each line in the file is some extract from a relevant source. The overall size for each file is roughly similar (very roughly 35 megabytes), but each file has characteristics that match the source. For example, the Twitter corpus has many more lines, but each line is shorter.
In the table below, there are some basic information pulled while performing the processing steps.
Summary values from the three corpora.
| Label | Blogs | News | |
|---|---|---|---|
| Lines Parsed | 1010245 | 899288 | 2360150 |
| Phrases Parsed | 2414112 | 2643061 | 4351072 |
| Words Parsed | 36419984 | 39865815 | 34134638 |
| Unique Words | 249503 | 291339 | 340336 |
| Repeated Words | 146878 | 151583 | 145460 |
| Words not pure ASCII | 1233 | 1505 | 334 |
| Unique 2-grams | 5882190 | 5905363 | 4770121 |
| Repeated 2-grams | 1903711 | 1806824 | 1397670 |
| Unique 3-grams | 17637407 | 18247682 | 13305171 |
| Repeated 3-grams | 3182370 | 3104092 | 2348703 |
| Unique 4-grams | 25912597 | 27753673 | 19250563 |
| Repeated 4-grams | 2471526 | 2359766 | 2029436 |
These files may contain over 100 million words in total, but the distributions are very skewed. A relatively small number of words show up with high frequency. For each corpus the top 100 common words occurences summed together account for about half of all words found. On the other hand, nearly half of the unique words found in each corpus are found only once.
This skewing gets even stronger when looking at the n-grams. Only about a third of the 2-grams show up more than once. This fraction drops severely for the 3-grams, and for the 4-grams only about 1-in-10 shows up more than once.
These plots attempt to show the frequency distributions. For each plot, each n-gram has been sorted into buckets based on the number of times that n-gram has been found in the corpus. To accomodate the long-tail behavior the buckets have been defined based on powers of two, so the bucket limits are 1, 2, 4, 8, 16, and so on up to 2^16 in this case. The count of n-grams that fall into each bucket are plotted on the Y-axis. Note, to accomodate the hard skewing, the axes on these charts are scaled by sqrt() function (distances further from the origin are compressed).
This suggests that we may be able to keep memory usage low by being selective about how we handle the rare cases. It does not seem likely that values seen only once in training will be of much help predicting real values. It is expected that we will need to perform a number of training experiments to determine the right memory versus prediction trade offs.
A potentially more interesting issue is that while the basic information for each corpus may be roughly similar, there are notable differences between the corpora when it comes to where words fall in relative frequency. It is not surprising that Twitter’s short messages have certain words that very common in tweets even if they are uncommon elsewhere, but there are also common words that differ by orders of magnitude of relative frequency between the News corpus and the Blog corpus.
The plot below shows selected words and where they show up in relative frequency across each of the three corpora. Note the use of a log10() scale on the Y-axis.
The words shown here were chosen effectively at random, From each corpus we took the word that was indexed by a power of 2 (2, 4, 8, 16, 32, …) and we also looked up where that same word was indexed in the other two corpora. It is not surprising that ‘im’ is found three orders of magnitude more in the Twitter corpus than the others. However, several other words also move by an order of magnitude or more, note the sharp bends in the lines for ‘my’, ‘were’, ‘area’, ‘ended’ and others.
Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.
We are still a long way from having a working prediction system. However, it is clear that these corpora do provide us with plenty to work with. The obvious concern will be managing a memory versus prediction performance trade off, and it is quite likely that significant time will be spent on evaluating criteria that can be used to trim unnecessary entries from the n-gram tables. If it is possible to be effective within reasonable constraints, perhaps we may make available an input set from each provided corpus. This could allow the user to select which dataset to use as the basis for predicting their writing – enabling the user to see how the choice of inputs leads to differing application behavior.
This report was produced in RStudio using the ‘Tufte Handout’ template. If there are problems viewing any of this content, please check that the package’s introduction and demo content displays properly in your environment.