Ian Donaldson November 16th, 2014
This document takes an initial look at the data provided for the coursera capstone project for the Data Science specialization.
The project task is to develop a shiny app that allows a user to enter a few words and then predict the next word that the user is most likely to enter. The three training texts provided were collected from blogs, twitter feeds and news feeds. The corpus is further described here.
The table below lists the number of lines, words and characters in each of the training texts.
| file name | lines | words | characters | file size |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334690 | 208623081 | 200MB |
| en_US.news.txt | 1010242 | 34372720 | 205243643 | 196MB |
| en_US.twitter.txt | 2360148 | 30374206 | 166816544 | 159MB |
A sample of 5 sentences from the blogs data set looks like this:
1. Sickos
2. In addition to the death benefit, living benefits life insurance policies offer the trifecta of terminal, chronic and critical illness financial protection for insureds and their families. Used as an alternative to traditional long term care insurance, indexed universal life insurance products with living benefits can provide dollars for chronic illness with the added benefit of cash value buildup. Clients who balk at paying a hefty LTC insurance premium may be more than willing to purchase a living benefits index universal life policy as it can provide significant cash value if the policy is surrendered.
3. FEMALE CALLER (31:50): He (Michael Jackson) is truly the soundtrack of my life. I also have a theory about Sarah Palin as well and I’m going to put it out there on radio, hopefully someone can investigate.
4. But the thoughts are trickling back, and I know that the more I write, the more they'll start flooding back, and so relief is in sight there, too.
5. Maybe it just sags
6. A Culpepper town spokesperson said “The officer’s cruiser did have a video camera, but that it was not working.” For the second time this week I’m going to have to call bullshit. If my car doesn’t have everything working properly I get a moving violation and the police are all over it. I find it hard to beleive that a municipal police force would willingly and knowingly allow a cruiser on the streets with a broken video camera.
Hundreds of sentences from the blogs data set and the other two data sets were examined by hand to explore the range of issues that would need to be addressed in order to be able to successfully pre-process the text.
Three major issues were identified. First, each line in the training corpus could consist of a single sentence, mutliple sentences, and sentence fragments in any number of combinations. Clearly, it would be important to identify sentence boundaries before attempting to extract n-grams (a series of n-contiguous words where n can be 2,3,4…etc.) because informative n-grams cannot cross sentence barriers. Second, the use of punctuation was incredibly inconsistent and varied especially in the twitter corpus. The decision was made to remove a lot of punctuation in the first pass and then use the results of testing data to determine what punctuation forms might be most important to preserve in order to capture the most meaningful n-grams. Third, there were multiple token classes (like prices: $1.00, times: 10:23, email addresses: me@apple.com, and integers and real numbers: 1.023) that could be readily recognized and converted to standardized tokens like _price_, _time_, _email_ and _number_ . I hypothesized that doing so might help match user inputs containing these token types to their closest counterparts in the training text.
A number of ad hoc regular expressions were made to process text and deal with these issues - the strategy was to do this quickly without getting distracted by dealing with every single exception. Testing in later stages would identify those issues that impacted most on performance and that should be dealt with further.
The six sentences above are shown below after pre-processing.
1. _linestart_ sickos _lineend_
2. _linestart_ in addition to the death benefit living benefits life insurance policies offer the trifecta of terminal chronic and critical illness financial protection for insureds and their families _period_ _sentencestart_ used as an alternative to traditional long term care insurance indexed universal life insurance products with living benefits can provide dollars for chronic illness with the added benefit of cash value buildup _period_ _sentencestart_ clients who balk at paying a hefty ltc insurance premium may be more than willing to purchase a living benefits index universal life policy as it can provide significant cash value if the policy is surrendered _period_
3. _linestart_ female caller 31 50 he michael jackson is truly the soundtrack of my life _period_ _sentencestart_ i also have a theory about sarah palin as well and i m going to put it out there on radio hopefully someone can investigate _period_
4. _linestart_ but the thoughts are trickling back and i know that the more i write the more they'll start flooding back and so relief is in sight there too _period_
5. _linestart_ maybe it just sags _lineend_
6. _linestart_ a culpepper town spokesperson said the officer s cruiser did have a video camera but that it was not working for the second time this week i m going to have to call bullshit _period_ _sentencestart_ if my car doesn t have everything working properly i get a moving violation and the police are all over it _period_ _sentencestart_ i find it hard to beleive that a municipal police force would willingly and knowingly allow a cruiser on the streets with a broken video camera _period_
An index was made on 0.5% of the training data. The index consisted of two lists. The first list held an integer to represent each position in the text and the token that occurred at that position (tokens are space-separated). The second list held all of the distinct tokens and the positions in the training corpus at which that token occurred. So samples of these two lists looked like:
a list of positions and tokens occuring at those positions
1 : _linestart_
2 : sickos
3 : _lineend_
4 : _linestart_
5 : in
6 : addition
7 : to
...
and a list of all distinct tokens and the positions they occur at
addition : 6 21207 41202 45092 51582 ...
benefit : 10 61 5673 75323 97777 ...
death : 9 8155 9633 12767 17064 ...
sickos : 2
trifecta : 18
...
So, given a token like sickos it is possible to tell from the second list that it occurs once in the training data at position 2 and that the next token is a _lineend_ by looking at the first table. This word could never be part of anything longer than a one word sequence (unigram) because it occurs next to a sentence/line barrier that is not crossed when looking for a series of connected words. In contrast, the words “in addition” at positions 5 and 6 will form a bigram that can possibly be extended further to a trigram (by adding “to”) or longer n-gram.
In principal, this method of representing the data allows one to calculate all possible n-grams or to find the set of words that follow a given n-gram. The methods also uses no more memory than 2 times the size of the training corpus. The lists are represented in R using the hash package which just means that looking up values in these tables is faster than looking them up in R’s conventional list or matrix structures.
An index of the training corpus was constructed in 2.101689 mins. This index was used to determine that there are 19422 distinct unigrams in the sample.
The table below shows the frequency of the top 10 tokens (unigrams) and the percentage and cumulative percentage of the training corpus text that they compose.
## token freq percent cumulative
## 1 the 9565 4.454265 4.454265
## 2 _period_ 9242 4.303849 8.758115
## 3 _sentencestart_ 6867 3.197850 11.955965
## 4 and 5495 2.558932 14.514897
## 5 to 5429 2.528197 17.043094
## 6 _linestart_ 4496 2.093714 19.136809
## 7 a 4432 2.063910 21.200719
## 8 i 4349 2.025259 23.225978
## 9 of 4322 2.012685 25.238663
## 10 in 3124 1.454796 26.693459
Examination of the unigrams showed that 4605 unigrams (2.1444737)% accounted for 90% of the training corpus text. The majority of the unigrams (10106) were only seen once. A plot of the percentage of the training corpus (above table) represented by each token showed that the first 50 most-common tokens account for the majority of the text.
The index was used to generate all possible n-grams consisting of two to six words (bigrams to hexagrams). This was done by attempting to extend all unigrams by one word to obtain a set of bigrams. Bigrams were then extended to obtain trigrams and so on. The table below summarizes findings for these generated data.
The time required to create bigrams (or any other n-gram type) varied between 2 and 7 minutes (column “genTime”“). The number of distinct n-grams (count) was lowest for unigrams (19,422) and peaked for trigrams (153,000). The number of distinct n-grams with more than one occurrence steadily dropped as gram size increased until there were only 12 hexagrams with more than 1 occurrence (column”nfgt1“). Finally, the percentage of n-grams required to account for 90% of the text was 2% for unigrams and 47% for bigrams but was essentially 90% for all larger n-grams.
These data indicate that anything larger than a trigram is basically just repeating the training text and is not likely to be useful for predicting (with any reliability) a word that follows an input phrase.
## gramType genTime count nfgt1 pt90
## [1,] "unigram" "2.101689 mins" "19422" "9316" "2.1"
## [2,] "bigram" "2.286392 mins" "102148" "18270" "46.8"
## [3,] "trigram" "4.084025 mins" "153165" "4402" "85.3"
## [4,] "quadragram" "5.697091 mins" "146454" "158" "89.9"
## [5,] "pentagram" "6.451123 mins" "136367" "18" "90"
## [6,] "hexagram" "7.137874 mins" "126826" "12" "90"
These data were generated from a small sample of the training corpus. Larger samples will be tried to assess the number of new, informative n-grams that can be gained from training on larger samples. It is expected that a point of diminishing returns will be reached. In addition, each of the other two training corpuses will be examined with this method.
Next steps will involve writing an algorithm that predicts a word given an input phrase. At the same time, a performance method will be devised. These new algorithms and the above indexing method will be used in an iterative process of finding areas that are problematic for the model and correcting them by either improving the prediction algorithm or re-preprocessing the training text.