Executive Summary

This is a milestone report for the Coursera Data Science Specialization capstone project to create a model for predicting text input based upon a corpus of text sampled from twitter, blog, and news feed data sources. The project goal is to create a fast, small-footprint (in terms of system memory) application that will predict the next word a user will enter based upon the sequence of words previously entered. This report encompasses a basic exploratory analysis of the corpus and a look forward to development of the prediction model.

Exploratory Data Analysis

Summary Statistics

The data for this analysis was provided by the Johns Hopkins University team responsible for defining the project. It was provided as a ZIP compressed file containing text excerpts from twitter streams, blog streams, and news item streams. The excerpts were in English, German, Finnish, and Russian; only the English text was used for this analysis. No other text sources were considered.

Table 1 shows a few basic statistics about the raw text data.

Table 1: Raw Text Statistics
Source File Size (Mb) Line Count Word Count
news feeds 205.8 77,259 2,693,898
blogs 210.2 899,288 38,154,238
twitter 167.1 2,360,148 30,218,166

Note that the word count is the total number of words and not the number of unique words.

This amount of text is too much to deal with at an exploratory level. Therefore, a random sample of 1% of the text from each source was selected for further analysis. Table 2 shows the statistics of the sampled text.

Table 2: Sample Text Statistics
Source Line Count Word Count
news feeds 772 27,238
blogs 8,992 378,438
twitter 23,601 302,844

Pre-processing

A visual inspection of the data showed that it contained many elements that are not useful for creating the final predictive model. These include numerical elements (dates, times, numbers, etc.), Internet web addresses, email addresses, social media handles, and hashtags. The text also contained “bad” words (e.g., swear words/phrases) that would not be appropriate to suggest as predictions.

The following steps were applied to pre-process the text for further analysis.

  • Eliminate Internet/web elements
  • Remove punctuation
  • Remove numerical elements.
  • Remove unwanted words.
  • Down-shift everything to lower case.
  • Strip out unecessary white space.

Exploratory Analysis

Of greatest interest in exploring the text data is to understand the prevalence of common single words as well as common two-, three-, and four-word sequences (n-grams). This was accomplished for the sample text using the R tm package. A tm corpus was created containing all of the sample text. Following that, tm was used to determine the frequencies of words and n-grams. Chart 1 through Chart 4 show the top 20 most prevalent items of each n-gram size.

 

 

 

Note that words that are traditionally considered “stop” words (“and”, “but”, “or”, “the”, etc.) were not removed from the corpus. If the project goal was to categorize the contents of the text data, these words would have been removed as being unimportant to the semantic content of the text. However, these words are critical for the prediction model because a user may enter one of them as the next word in sequence. Therefore, they need to show up as possible predictions when appropriate. This fact is clearly evident in the n-gram sequences where these words show up very often in the most frequently used sequences.

Interestingly, the histograms show that the frequency counts of the n-grams drops by almost a full order of magnitude between each size; that is from a maximum around 30,000 for the most frequently used single word to a maximum around 3,000 for 2-grams, to around 300 for 3-grams, and around 60 for 4-grams.

Table 3: Memory Usage (bytes)
Frequency List Memory Used
Single Words 2,820,344
2-grams 23,053,776
3-grams 44,100,144
4-grams 54,689,704

Table 3 indicates that a large amount of memory is consumed by the n-gram frequency tables - and this for just a 1% sample of the full text data set.

Table 4: Top N-gram Usage (percent)
N-gram Size Total Count 25% Threshold 50% Threshold 75% Threshold
Single Words 43,768 0.1 0.6 4.4
2-grams 332,736 0.7 8.9 49.0
3-grams 584,289 12.9 41.9 71.0
4-grams 661,831 23.1 48.7 74.4

The histograms in Chart 1 through Chart 4 indicate a severe skewing of the frequency distribution in the direction of low-frequency n-grams. Table 4 shows this in another way as the percentage of n-grams (based on n-gram length, from most to least frequently occurring) that contribute to overall n-gram usage. For example, only 0.6% of the top most frequently occurring words account for 50% of all the words in the corpus. Similarly, 8.9% of the top most frequently occurring 2-grams account for 50% of the usage of all 2-grams identified. However, it takes 41.9% and 48.7% of the 3-grams and 4-grams, respectively, to reach the same 50% usage threshold.

Table 5 shows what is implied by Table4 - that a very high percentage of the n-grams occur only once in the sample data.

Table 5: Single Occurrence N-grams (percent)
N-gram Size N-grams occurring only 1 time
Single Words 55.7
2-grams 81.1
3-grams 93.4
4-grams 98.4

Conclusion

Given the degree of change in frequency counts between n-gram sizes, it appears that the shorter n-grams are likely to play the most significant role in the prediction algorithm. Whereas the long three- and four-word sequences might possibly yield high prediction accuracy, their actual infrequency of occurrence within user-entered text could possibly lead to a high number of prediction failures if they were the sole basis for the model.

Memory usage will be an important issue in the design of the prediction model. Even if the memory usage scales only linearly with sample size (a highly unlikely situation), the amount of memory needed to store a set of frequency tables covering the entire text data set would probably prohibit being able to create them, let alone use them in a prediction model, without substantial computing and memory capacity. Fortunately, some paring of the single-word and 2-gram sequences appears possible to scale back the memory requirement of prediction model. Paring of the longer n-grams will require some experimentation to find the “sweet spot” in the tradeoff between model accuracy and memory footprint.