Word Corpus Analysis Report

The Capstone Project of the Coursera Data Science Specialization requires us to develop a text prediction application like the product that SwiftKey offers. The objective of this first report is to summarize the exploratory analysis of the english corpus that we’ll use as a basis for prediction. We’ll also describe our plans for creating the prediction algorithm and the application.

Exploratory Data Analysis

The corpus has three sources that together contain over 100 million words:

A set of 899,288 blog entries containing an average of 42 words per entry
A set of 1,010,242 news lines containing an average of 34 words per line
A set of 2.3 million twitter messages containing an average of 13 words per line

In order to have more sensible predictions, I decided to use the “sentence” as the basic unit of analysis (instead of using “lines”) since the elements of a sentence can be expected to have certain coherence whereas words in the same line are not necesarily linked. I found that in average:

the blogs have 2.7 sentences per line,
the news have 2.1 sentences per line, and
twitter has 1.6 sentences per line.

I considered a new sentence whenever I encountered a period, question mark, exclamation mark, or end of line. Prior to splitting sentences, I replaced all stand-alone numbers (including those with commas, signs, and decimal point) with a token (in order to keep track of the number position while avoiding confusing a decimal point with an end of sentence.) Also replaced some frequent abreviations (such as “p.m.” with “pm”.) The number of sentences per line for blogs and news seems low, so refining how sentences are defined is something that will warrant further examination.

Word Distribution

In order to analize the word frequencies in the sentences, I took a random sample of the documents and applied the following criteria:

Converted the contents to lowercase in order to avoid counting words twice due to capitalization.
Removed all special characters except apostrophes (so that contractions and possesive nouns remain intact.)
Decided not to convert the sentences to “stem words” and not to remove “stop words” because that would hurt the quality of our predictions. All words need to be counted exactly as they were written… so that we can predict the right word in the right place in the future. Perhaps misspellings are a valid exception to this rule. We will discuss that later in this report.

After these steps, I ended up with 4.6 million words that contain about 125 thousand unique terms. Here is the distribution of words frequences:

The graph shows a “highly skewed” distribution of words, or, said in another way, there is a very large number of terms that appear with very low frequency. Actually, almost 50% of the terms appear only once in the sample! I found that many of those infrequent terms are the product of spelling mistakes but some other are valid terms (just uncommon, such as “agronomist”.) In either case, we’ll ignore these terms in the final solution given the little value they add for what they would cost in terms of memory and prediction speed.

The graph shows bars only up to 40 occurences per term, because -as one can see- the number of terms drops significantly as we approach that point. At the other extreme of the range (not visible in the graph) we find a few words that appear with extremely high frequency such as “the” (which occurred over 289 thousand times in the same sample) or “and” (over 149 thousand times.)

The following table shows the distributions of word frequencies (incuding cummulative percentages) in both extremes of the range (e.g. for the least frequent and most frequent terms):

##  Frequency  Unique Words Cummulative%   Total Occurrences Cummulative%
##          1        60,585        48.8%              60,585        1.31%
##          2        16,016        61.7%              32,032        2.00%
##          3         8,193        68.3%              24,579        2.53%
##          4         5,040        72.4%              20,160        2.96%
##          5         3,684        75.3%              18,420        3.36%
##          6         2,745        77.5%              16,470        3.71%
##          7         2,176        79.3%              15,232        4.04%
##          8         1,803        80.7%              14,424        4.35%
##          9         1,458        81.9%              13,122        4.63%
##         10         1,271        82.9%              12,710        4.91%
##        ...                                                            
##     44,293             1    99.99597%              44,293        86.0%
##     60,325             1    99.99678%              60,325        87.3%
##     63,371             1    99.99758%              63,371        88.6%
##     88,496             1    99.99839%              88,496        90.5%
##    149,770             1    99.99919%             149,770        93.8%
##    289,959             1   100.00000%             289,959       100.0%
##      TOTAL       124,152                        4,641,107

Bigram Distribution

Next, we analyzed “pairs of consecutive words” (e.g. bigrams) within the same sentence. We found a total of 5.4 million word pairs in the sample which contained 1.6 million unique pairs. Here is the distribution of frequencies for those pairs:

The graph shows that bigrams are also highly skewed. On the lower end of the range we have many pairs that apper only once (for example rarities such as “zyrtec makes”) while on the other side we have a few pairs that appear a disproportionatly large number of times such as “of the” (27+ thousand times) or “in the” (24+ thousand times.)

Like we did before, we also present this information in tabular form:

##  Frequency  Unique pairs Cummulative%   Total Occurrences Cummulative%
##          1     1,199,823        74.6%           1,199,823        22.0%
##          2       172,689        85.3%             345,378        28.4%
##          3        68,787        89.6%             206,361        32.2%
##          4        37,234        91.9%             148,936        34.9%
##          5        23,832        93.4%             119,160        37.1%
##          6        16,547        94.4%              99,282        38.9%
##          7        12,155        95.2%              85,085        40.5%
##          8         9,127        95.7%              73,016        41.8%
##          9         7,389        96.2%              66,501        43.0%
##         10         5,842        96.6%              58,420        44.1%
##        ...                                                            
##      9,269             1    99.99969%               9,269       98.39%
##     10,793             1    99.99975%              10,793       98.59%
##     11,663             1    99.99981%              11,663       98.81%
##     12,948             1    99.99988%              12,948       99.04%
##     24,882             1    99.99994%              24,882       99.50%
##     27,152             1   100.00000%              27,152      100.00%
##      TOTAL     1,608,601                        5,445,373

3-Gram Distribution

Next, we analize sequences of 3 consecutive words in the same sentence (otherwise known as “three-grams”.) We found a total of 5 million three-grams occurences in the sample which contained 3.5 million unique three-grams. Looking at the graph, it becomes clear that the 3-gram distribution is even more skewed than the 2-gram and single-word distributions:

At the top of the range we have three-grams such as: “one of the” (with 2143 occurrences) and “a lot of” (1868 occrrences.) On the other side of the range we have over 3.1 million three-grams that occured only once in the entire samlple (for example: “zzziiiiiiiiippp there goes”.) Three-grams like the latter presumably won’t be very relevant for prediction.

##  Frequency  Unique Sequences Cummulative%   Total Occurrences Cummulative%
##          1         3,112,219        88.4%           3,112,219        61.6%
##          2           218,377        94.6%             436,754        70.3%
##          3            70,823        96.7%             212,469        74.5%
##          4            34,031        97.6%             136,124        77.2%
##          5            19,769        98.2%              98,845        79.2%
##          6            12,862        98.5%              77,172        80.7%
##          7             8,919        98.8%              62,433        81.9%
##          8             6,564        99.0%              52,512        83.0%
##          9             4,891        99.1%              44,019        83.8%
##         10             3,997        99.2%              39,970        84.6%
##        ...                                                                
##        959                 1    99.99986%                 959      99.858%
##      1,000                 1    99.99989%               1,000      99.878%
##      1,035                 1    99.99992%               1,035      99.899%
##      1,109                 1    99.99994%               1,109      99.921%
##      1,868                 1    99.99997%               1,868      99.958%
##      2,143                 1   100.00000%               2,143     100.000%
##      TOTAL         3,519,133                        5,048,581

4-Gram and 5-Gram Distribution

I also analized the sequences of 4 and 5 consecutive words:

Found about 4.7 million four-grams of which 4.3 million are unique. The most common one (“the end of the”) appeared 490 times.
Found about 4.3 million five-grams of which 4.2 million are unique. The most common one (“at the end of the”) appeared 244 times.
After looking at the counts for 4- and 5-grams, it became clear that it didn’t make sense going further (at least with the current size of our corpus sample.)

In conclusion, as we look for longer N-gram chains (e.g. as we increase N), the number of possible word combinations multiplies while the ability to fit those longer sequences in the same-length sentences is reduced. The skewness increases as more n-grams tend to have frequency 1 (which is not good) and at the same time the higher frequency n-grams tend to occur less times. This suggests that we would need to significantly increase the corpus size (or sample size) if we want to effectively use 4-grams, 5-grams, or beyond.

Plan for Creating Prediction Algorithm and Application

Regarding the design and the training of the predictive algorithm, I plan to:

Increase the number of english documents loaded (e.g. increase sample size) in order to have higher counts (particularly for 3-grams, 4-grams and 5-grams.) I am planning to test a solution with 4-gram and perhaps 5-gram sequences, but if the predictive accuracy doesn’t improve materially, then I may just keep a 1-2-3-gram solution (I’m assuming that we’ll see an economy advantage when using shorter sequences.)
Consider alternative sentence definitions. Try to determine if there is a better approach to cut sentences other than at the periods (after replacement of numbers and abreviations), question marks, exclamation marks, and end of lines. If no other reliable programatic method can be identified, the current definition is acceptable, but its worth trying to improve it.
With the larger corpus and the final sentence definitions, I will repeat the data preparation steps outlined before: Replace numbers and common abbreviations. Split sentences. Remove special characters (except apostrophes in order to keep contractions and possesive nouns intact.) Tokenize the sentences and compute words/n-gram frequences.
I will discard low frequency words and n-grams (there are too many of them, and they add little value.) I will certainly drop those with frequency 1 as these account for almost 50% of the words and over 70% of the n-grams. If the application appears to have performance or memory economy issues, I will test dropping words of frequency 2 or 3. Given what we just saw in exploratory analysis, that would save a lot of resources.
I’m planing to code my own predictive algorithm based on an N-gram model since we have a relativily straightforward methodology available (in the COursera NLP Course Week 2 videos - which I want to learn to implement.) The metodology will require two things:

An efficient way to storing and quickly retrieving frequencies/probabilities for a word given the previous (N-1) words. I am considering testing “hash tables” (e.g. R environments) for this purpose, because they allow very hast retrieval. The (N-1) previous words would be the “search key” in the table, and the “next word candidates” with their respective probabilities would be stored as the value of the table. An alternative to the “hash tables” could be to use a large dataframe (but it may not be as computationally efficient.)
A process to calculate the candidate word probabilities. I’m considering implementing the Good Turing methodology for this.

I won’t add numbers as candidates for prediction (just words) but I will keep a special token to remember the position of a number in the n-gram. This will allow us to take advantage of high frequency patterns that have numbers (e.g. “# year old” “# percent of”, “at # pm”, “between # and”)
I won’t add profanity words as the candidates for prediction, but I will keep them in the corpus. If we were to delete profanity words, we could be introducing gaps that in the data that could lead to incorrect n-grams and incorrect predictions. I will use a standard profanily list (from Google) to identify profanity terms.
Stop words (e.g. very common words) also need to remain in the corpus because they could be valid predictions.
I’m not too concerned about spelling errors in the corpus. I’m considering using a dictionary and add terms that appear with medium or high frequency in teh corpus (even if not in the “official”" dictionary.) If in practice people use a certain word frequently, then we may want to predict it. We would only drop terms that are not in the dictonary and have low frequency in our corpus.

Finally, I have mentioned the intention of testing and comparing various algorythms (with N-grams of size 3, 4, or 5) and perhaps dropping more terms (those of frequency 2 or 3 in addition to 1). So, how do I plan to compare which option works better? Here is my plan:

I would randomly select a group of documents as a “validation set” (which would be completely independent of the training documents.)
I would have the algorythm run automatically to predict all the “next words” in those validation documents. When the prediction matches the actual word, I will record a success. Otherwise, record a failure and count all the successes and failures at the end.
I will time the algorithms and measure their memory consumption.
Based on the prediction accuracy, speed, and memory use, I will select the final algorithm for my product.

Report date: “Sunday, November 16, 2014”