Capstone Milestone Report: Exploratory Analysis of the Project Text Corpus

Exploratory Data Analysis

Summary Statistics

The data for this analysis was provided by the Johns Hopkins University team responsible for defining the project. It was provided as a ZIP compressed file containing text excerpts from twitter streams, blog streams, and news item streams. The excerpts were in English, German, Finnish, and Russian; only the English text was used for this analysis. No other text sources were considered.

Table 1 shows a few basic statistics about the raw text data.

Table 1: Raw Text Statistics
Source	File Size (Mb)	Line Count	Word Count
news feeds	205.8	77,259	2,693,898
blogs	210.2	899,288	38,154,238
twitter	167.1	2,360,148	30,218,166

Note that the word count is the total number of words and not the number of unique words.

This amount of text is too much to deal with at an exploratory level. Therefore, a random sample of 1% of the text from each source was selected for further analysis. Table 2 shows the statistics of the sampled text.

Table 2: Sample Text Statistics
Source	Line Count	Word Count
news feeds	772	27,238
blogs	8,992	378,438
twitter	23,601	302,844

Pre-processing

A visual inspection of the data showed that it contained many elements that are not useful for creating the final predictive model. These include numerical elements (dates, times, numbers, etc.), Internet web addresses, email addresses, social media handles, and hashtags. The text also contained “bad” words (e.g., swear words/phrases) that would not be appropriate to suggest as predictions.

The following steps were applied to pre-process the text for further analysis.

Eliminate Internet/web elements
Remove punctuation
Remove numerical elements.
Remove unwanted words.
Down-shift everything to lower case.
Strip out unecessary white space.

Exploratory Analysis

Of greatest interest in exploring the text data is to understand the prevalence of common single words as well as common two-, three-, and four-word sequences (n-grams). This was accomplished for the sample text using the R tm package. A tm corpus was created containing all of the sample text. Following that, tm was used to determine the frequencies of words and n-grams. Chart 1 through Chart 4 show the top 20 most prevalent items of each n-gram size.

Note that words that are traditionally considered “stop” words (“and”, “but”, “or”, “the”, etc.) were not removed from the corpus. If the project goal was to categorize the contents of the text data, these words would have been removed as being unimportant to the semantic content of the text. However, these words are critical for the prediction model because a user may enter one of them as the next word in sequence. Therefore, they need to show up as possible predictions when appropriate. This fact is clearly evident in the n-gram sequences where these words show up very often in the most frequently used sequences.

Interestingly, the histograms show that the frequency counts of the n-grams drops by almost a full order of magnitude between each size; that is from a maximum around 30,000 for the most frequently used single word to a maximum around 3,000 for 2-grams, to around 300 for 3-grams, and around 60 for 4-grams.

Table 3: Memory Usage (bytes)
Frequency List	Memory Used
Single Words	2,820,344
2-grams	23,053,776
3-grams	44,100,144
4-grams	54,689,704

Table 3 indicates that a large amount of memory is consumed by the n-gram frequency tables - and this for just a 1% sample of the full text data set.

Table 4: Top N-gram Usage (percent)
N-gram Size	Total Count	25% Threshold	50% Threshold	75% Threshold
Single Words	43,768	0.1	0.6	4.4
2-grams	332,736	0.7	8.9	49.0
3-grams	584,289	12.9	41.9	71.0
4-grams	661,831	23.1	48.7	74.4

The histograms in Chart 1 through Chart 4 indicate a severe skewing of the frequency distribution in the direction of low-frequency n-grams. Table 4 shows this in another way as the percentage of n-grams (based on n-gram length, from most to least frequently occurring) that contribute to overall n-gram usage. For example, only 0.6% of the top most frequently occurring words account for 50% of all the words in the corpus. Similarly, 8.9% of the top most frequently occurring 2-grams account for 50% of the usage of all 2-grams identified. However, it takes 41.9% and 48.7% of the 3-grams and 4-grams, respectively, to reach the same 50% usage threshold.

Table 5 shows what is implied by Table4 - that a very high percentage of the n-grams occur only once in the sample data.

Table 5: Single Occurrence N-grams (percent)
N-gram Size	N-grams occurring only 1 time
Single Words	55.7
2-grams	81.1
3-grams	93.4
4-grams	98.4

Conclusion

Given the degree of change in frequency counts between n-gram sizes, it appears that the shorter n-grams are likely to play the most significant role in the prediction algorithm. Whereas the long three- and four-word sequences might possibly yield high prediction accuracy, their actual infrequency of occurrence within user-entered text could possibly lead to a high number of prediction failures if they were the sole basis for the model.

Memory usage will be an important issue in the design of the prediction model. Even if the memory usage scales only linearly with sample size (a highly unlikely situation), the amount of memory needed to store a set of frequency tables covering the entire text data set would probably prohibit being able to create them, let alone use them in a prediction model, without substantial computing and memory capacity. Fortunately, some paring of the single-word and 2-gram sequences appears possible to scale back the memory requirement of prediction model. Paring of the longer n-grams will require some experimentation to find the “sweet spot” in the tradeoff between model accuracy and memory footprint.

Capstone Milestone Report: Exploratory Analysis of the Project Text Corpus

Jeff Spoelstra

2016-08-26

Executive Summary

Exploratory Data Analysis

Summary Statistics

Pre-processing

Exploratory Analysis

Conclusion