The Capstone Project of the Coursera Data Science Specialization requires us to develop a text prediction application like the product that SwiftKey offers. The objective of this first report is to summarize the exploratory analysis of the english corpus that we’ll use as a basis for prediction. We’ll also describe our plans for creating the prediction algorithm and the application.
The corpus has three sources that together contain over 100 million words:
A set of 899,288 blog entries containing an average of 42 words per entry
A set of 1,010,242 news lines containing an average of 34 words per line
A set of 2.3 million twitter messages containing an average of 13 words per line
In order to have more sensible predictions, I decided to use the “sentence” as the basic unit of analysis (instead of using “lines”) since the elements of a sentence can be expected to have certain coherence whereas words in the same line are not necesarily linked. I found that in average:
the blogs have 2.7 sentences per line,
the news have 2.1 sentences per line, and
twitter has 1.6 sentences per line.
I considered a new sentence whenever I encountered a period, question mark, exclamation mark, or end of line. Prior to splitting sentences, I replaced all stand-alone numbers (including those with commas, signs, and decimal point) with a token (in order to keep track of the number position while avoiding confusing a decimal point with an end of sentence.) Also replaced some frequent abreviations (such as “p.m.” with “pm”.) The number of sentences per line for blogs and news seems low, so refining how sentences are defined is something that will warrant further examination.
In order to analize the word frequencies in the sentences, I took a random sample of the documents and applied the following criteria:
Converted the contents to lowercase in order to avoid counting words twice due to capitalization.
Removed all special characters except apostrophes (so that contractions and possesive nouns remain intact.)
Decided not to convert the sentences to “stem words” and not to remove “stop words” because that would hurt the quality of our predictions. All words need to be counted exactly as they were written… so that we can predict the right word in the right place in the future. Perhaps misspellings are a valid exception to this rule. We will discuss that later in this report.
After these steps, I ended up with 4.6 million words that contain about 125 thousand unique terms. Here is the distribution of words frequences:
The graph shows a “highly skewed” distribution of words, or, said in another way, there is a very large number of terms that appear with very low frequency. Actually, almost 50% of the terms appear only once in the sample! I found that many of those infrequent terms are the product of spelling mistakes but some other are valid terms (just uncommon, such as “agronomist”.) In either case, we’ll ignore these terms in the final solution given the little value they add for what they would cost in terms of memory and prediction speed.
The graph shows bars only up to 40 occurences per term, because -as one can see- the number of terms drops significantly as we approach that point. At the other extreme of the range (not visible in the graph) we find a few words that appear with extremely high frequency such as “the” (which occurred over 289 thousand times in the same sample) or “and” (over 149 thousand times.)
The following table shows the distributions of word frequencies (incuding cummulative percentages) in both extremes of the range (e.g. for the least frequent and most frequent terms):
## Frequency Unique Words Cummulative% Total Occurrences Cummulative%
## 1 60,585 48.8% 60,585 1.31%
## 2 16,016 61.7% 32,032 2.00%
## 3 8,193 68.3% 24,579 2.53%
## 4 5,040 72.4% 20,160 2.96%
## 5 3,684 75.3% 18,420 3.36%
## 6 2,745 77.5% 16,470 3.71%
## 7 2,176 79.3% 15,232 4.04%
## 8 1,803 80.7% 14,424 4.35%
## 9 1,458 81.9% 13,122 4.63%
## 10 1,271 82.9% 12,710 4.91%
## ...
## 44,293 1 99.99597% 44,293 86.0%
## 60,325 1 99.99678% 60,325 87.3%
## 63,371 1 99.99758% 63,371 88.6%
## 88,496 1 99.99839% 88,496 90.5%
## 149,770 1 99.99919% 149,770 93.8%
## 289,959 1 100.00000% 289,959 100.0%
## TOTAL 124,152 4,641,107
Next, we analyzed “pairs of consecutive words” (e.g. bigrams) within the same sentence. We found a total of 5.4 million word pairs in the sample which contained 1.6 million unique pairs. Here is the distribution of frequencies for those pairs:
The graph shows that bigrams are also highly skewed. On the lower end of the range we have many pairs that apper only once (for example rarities such as “zyrtec makes”) while on the other side we have a few pairs that appear a disproportionatly large number of times such as “of the” (27+ thousand times) or “in the” (24+ thousand times.)
Like we did before, we also present this information in tabular form:
## Frequency Unique pairs Cummulative% Total Occurrences Cummulative%
## 1 1,199,823 74.6% 1,199,823 22.0%
## 2 172,689 85.3% 345,378 28.4%
## 3 68,787 89.6% 206,361 32.2%
## 4 37,234 91.9% 148,936 34.9%
## 5 23,832 93.4% 119,160 37.1%
## 6 16,547 94.4% 99,282 38.9%
## 7 12,155 95.2% 85,085 40.5%
## 8 9,127 95.7% 73,016 41.8%
## 9 7,389 96.2% 66,501 43.0%
## 10 5,842 96.6% 58,420 44.1%
## ...
## 9,269 1 99.99969% 9,269 98.39%
## 10,793 1 99.99975% 10,793 98.59%
## 11,663 1 99.99981% 11,663 98.81%
## 12,948 1 99.99988% 12,948 99.04%
## 24,882 1 99.99994% 24,882 99.50%
## 27,152 1 100.00000% 27,152 100.00%
## TOTAL 1,608,601 5,445,373
Next, we analize sequences of 3 consecutive words in the same sentence (otherwise known as “three-grams”.) We found a total of 5 million three-grams occurences in the sample which contained 3.5 million unique three-grams. Looking at the graph, it becomes clear that the 3-gram distribution is even more skewed than the 2-gram and single-word distributions:
At the top of the range we have three-grams such as: “one of the” (with 2143 occurrences) and “a lot of” (1868 occrrences.) On the other side of the range we have over 3.1 million three-grams that occured only once in the entire samlple (for example: “zzziiiiiiiiippp there goes”.) Three-grams like the latter presumably won’t be very relevant for prediction.
## Frequency Unique Sequences Cummulative% Total Occurrences Cummulative%
## 1 3,112,219 88.4% 3,112,219 61.6%
## 2 218,377 94.6% 436,754 70.3%
## 3 70,823 96.7% 212,469 74.5%
## 4 34,031 97.6% 136,124 77.2%
## 5 19,769 98.2% 98,845 79.2%
## 6 12,862 98.5% 77,172 80.7%
## 7 8,919 98.8% 62,433 81.9%
## 8 6,564 99.0% 52,512 83.0%
## 9 4,891 99.1% 44,019 83.8%
## 10 3,997 99.2% 39,970 84.6%
## ...
## 959 1 99.99986% 959 99.858%
## 1,000 1 99.99989% 1,000 99.878%
## 1,035 1 99.99992% 1,035 99.899%
## 1,109 1 99.99994% 1,109 99.921%
## 1,868 1 99.99997% 1,868 99.958%
## 2,143 1 100.00000% 2,143 100.000%
## TOTAL 3,519,133 5,048,581
I also analized the sequences of 4 and 5 consecutive words:
Found about 4.7 million four-grams of which 4.3 million are unique. The most common one (“the end of the”) appeared 490 times.
Found about 4.3 million five-grams of which 4.2 million are unique. The most common one (“at the end of the”) appeared 244 times.
After looking at the counts for 4- and 5-grams, it became clear that it didn’t make sense going further (at least with the current size of our corpus sample.)
In conclusion, as we look for longer N-gram chains (e.g. as we increase N), the number of possible word combinations multiplies while the ability to fit those longer sequences in the same-length sentences is reduced. The skewness increases as more n-grams tend to have frequency 1 (which is not good) and at the same time the higher frequency n-grams tend to occur less times. This suggests that we would need to significantly increase the corpus size (or sample size) if we want to effectively use 4-grams, 5-grams, or beyond.
Regarding the design and the training of the predictive algorithm, I plan to:
Increase the number of english documents loaded (e.g. increase sample size) in order to have higher counts (particularly for 3-grams, 4-grams and 5-grams.) I am planning to test a solution with 4-gram and perhaps 5-gram sequences, but if the predictive accuracy doesn’t improve materially, then I may just keep a 1-2-3-gram solution (I’m assuming that we’ll see an economy advantage when using shorter sequences.)
Consider alternative sentence definitions. Try to determine if there is a better approach to cut sentences other than at the periods (after replacement of numbers and abreviations), question marks, exclamation marks, and end of lines. If no other reliable programatic method can be identified, the current definition is acceptable, but its worth trying to improve it.
With the larger corpus and the final sentence definitions, I will repeat the data preparation steps outlined before: Replace numbers and common abbreviations. Split sentences. Remove special characters (except apostrophes in order to keep contractions and possesive nouns intact.) Tokenize the sentences and compute words/n-gram frequences.
I will discard low frequency words and n-grams (there are too many of them, and they add little value.) I will certainly drop those with frequency 1 as these account for almost 50% of the words and over 70% of the n-grams. If the application appears to have performance or memory economy issues, I will test dropping words of frequency 2 or 3. Given what we just saw in exploratory analysis, that would save a lot of resources.
I’m planing to code my own predictive algorithm based on an N-gram model since we have a relativily straightforward methodology available (in the COursera NLP Course Week 2 videos - which I want to learn to implement.) The metodology will require two things:
An efficient way to storing and quickly retrieving frequencies/probabilities for a word given the previous (N-1) words. I am considering testing “hash tables” (e.g. R environments) for this purpose, because they allow very hast retrieval. The (N-1) previous words would be the “search key” in the table, and the “next word candidates” with their respective probabilities would be stored as the value of the table. An alternative to the “hash tables” could be to use a large dataframe (but it may not be as computationally efficient.)
A process to calculate the candidate word probabilities. I’m considering implementing the Good Turing methodology for this.
I won’t add numbers as candidates for prediction (just words) but I will keep a special token to remember the position of a number in the n-gram. This will allow us to take advantage of high frequency patterns that have numbers (e.g. “# year old” “# percent of”, “at # pm”, “between # and”)
I won’t add profanity words as the candidates for prediction, but I will keep them in the corpus. If we were to delete profanity words, we could be introducing gaps that in the data that could lead to incorrect n-grams and incorrect predictions. I will use a standard profanily list (from Google) to identify profanity terms.
Stop words (e.g. very common words) also need to remain in the corpus because they could be valid predictions.
I’m not too concerned about spelling errors in the corpus. I’m considering using a dictionary and add terms that appear with medium or high frequency in teh corpus (even if not in the “official”" dictionary.) If in practice people use a certain word frequently, then we may want to predict it. We would only drop terms that are not in the dictonary and have low frequency in our corpus.
Finally, I have mentioned the intention of testing and comparing various algorythms (with N-grams of size 3, 4, or 5) and perhaps dropping more terms (those of frequency 2 or 3 in addition to 1). So, how do I plan to compare which option works better? Here is my plan:
I would randomly select a group of documents as a “validation set” (which would be completely independent of the training documents.)
I would have the algorythm run automatically to predict all the “next words” in those validation documents. When the prediction matches the actual word, I will record a success. Otherwise, record a failure and count all the successes and failures at the end.
I will time the algorithms and measure their memory consumption.
Based on the prediction accuracy, speed, and memory use, I will select the final algorithm for my product.
Report date: “Sunday, November 16, 2014”