Capstone Exploratory Data Analysis Report

This is a preliminary report on the text prediction capstone project for the Johns Hopkins Data Science certification delivered by coursera.org. The goal is to construct a langauge model based on a set of data from blogs, online news articles and twitter posts so that it can be used for a text prediction algorithm. This report demonstrates that the data have been analyzed and that some preliminary considerations have been undertaken toward the goal of producing the language model and text prediction application.

I am treating three data files as the corpus for this project: a file of blog posts, one of twitter posts, and one of news articles. Four sets of such files were provided: one in English, Russian, Finnish, and German. As per the Dr. Peng’s instruction, this project is to focus on the English language files, and that’s what I’ve done here. The code to produce this document is mostly hidden, as the instructions indicate the audience should be a non-technical manager.

Text Data Size

Below are the line, word and character counts for the three files:

lines	words	chars	filename
899288	37334690	210160014	final/en_US/en_US.blogs.txt
2360148	30374206	167105338	final/en_US/en_US.twitter.txt
1010242	34372720	205811889	final/en_US/en_US.news.txt

The files were read in and tokenized using the tokenizers package. From the tokens, unigram and bigram frequencies were computed – that is occurrences of single words and pairs of words were counted up and recorded. Trigram (occurrences of three words) frequencies will also be computed, but the process I’ve been using for tokenization may need to be optimized to be able to store trigram occurrences more efficiently. More specifically, I will probably need a batched approach instead of invoking functions directly on the entire data set.

A note on data structures

One of the activities that took more time than expected was exploring different approaches to storing word and n-gram frequencies, as well as pre-computed predictions. Specifically, I tested out using data frames, hashmaps, environments as hashmaps, and I also researched the use of tries, which seem especially attractive for this purpose. However, I didn’t find a flexible enough trie implementation in R (one where attributes could be set on words as well as where exact key matches could be queried). My current thinking is that I will use character vectors for index_to_word, hashmap for word_to_index, and matrices for index_of_context_words -> index_of_prediction. This is still very preliminary, though.

Word Frequency Distribution in the Data

Below are some stats on the unigram (single word) frequency distribution in the corpus (the text data). A small fraction of the words make up the majority of all encountered occurences. After passing the tokens through a 1) profanity filter and 2) a “word” filter looking for only alphabetic characters and punctuation, 628,394 individual “words” were encountered. These are not all valid English words, however, as other scripts were observed in the data.

summary(word_freq)

##      word             frequency      
##  Length:628394      Min.   :      1  
##  Class :character   1st Qu.:      1  
##  Mode  :character   Median :      1  
##                     Mean   :    152  
##                     3rd Qu.:      4  
##                     Max.   :4771927

head(word_freq,5)

##   word frequency
## 1  the   4771927
## 2   to   2764230
## 3  and   2422450
## 4   in   1657973
## 5    i   1657335

The following plot (y-axis on log scale) shows word frequencies for each word, where the words are sorted from most frequent to least. The x-axis numbers are simply the integer id’s of each word, where 1 would be the most frequent word and 628,394 would be the least frequent word. The y-axis represents the number of occurrences of each word.

The following plot shows the cumulative proportion of all encountered words, from most frequent to least. That is to say at position x = 1, the plot shows what proportion of all encountered words is represented by the most frequent word. At x = 2, we see the proportion of the two most frequent words, then the three most, etc. The two horizontal lines are drawn at 0.5 (50%) and 0.9 (90%).

More precisely, the number of most frequent words comprising 50%, 90%, 95% and 99% are calculated directly. The code is shown here in case the reviewer knows a better method and can enlighten me. The code I came up with seems just a bit more effortful than I suspect is required.

# numerically, same question: subset the cumsum vector to include only values above the threshold
# and then see how many values were filtered by that operation -- that will be k - 1 words where 
# k is the number of most frequent words comprising the threshold fraction of the corpus. Probably
# a more direct way to do this -- would like to know.
idx50 <- length(x) - length(x[x >= 0.5]) + 1
idx90 <- length(x) - length(x[x >= 0.9]) + 1
idx95 <- length(x) - length(x[x >= 0.95]) + 1
idx99 <- length(x) - length(x[x >= 0.99]) + 1
wordcounts <- c(idx50,idx90,idx95,idx99)

Percent of Corpus Comprised of Top N Words

top_n_words	percent_of_corpus
171	50%
7860	90%
18496	95%
106315	99%

Bigram Frequency Distribution in the Data

Below are some stats on the bigram (word pair) frequency distribution. While there were around 600,000 words encountered, there are more than 15,000,000 bigrams encountered.

summary(n2gram_freq)

##       w1                 w2              frequency       
##  Length:15305828    Length:15305828    Min.   :     1.0  
##  Class :character   Class :character   1st Qu.:     1.0  
##  Mode  :character   Mode  :character   Median :     1.0  
##                                        Mean   :     6.1  
##                                        3rd Qu.:     2.0  
##                                        Max.   :408712.0

head(n2gram_freq,5)

##    w1  w2 frequency
## 1  in the    408712
## 2  to the    213968
## 3 for the    201253
## 4  on the    197500
## 5  to  be    162757

The following plot (y-axis on log scale) shows bigram frequencies for each bigram, where the bigrams are sorted from most frequent to least. While the most frequent word in the corpus was found more than 4,000,000 times, the most frequent bigram was found about 400,000 times, an order of magnitude of difference. Still the shape of the curve is similar to the unigram plot.

The following plot shows the cumulative proportion of all encountered bigrams, from most frequent to least. Whereas 171 words comprise 50% of all words encountered in the corpus, it takes almost 60,000 bigrams to account for half of all the bigrams encountered.

Again, the number of most frequent bigrams comprising 50%, 90%, 95% and 99% are calculated directly.

Percent of Corpus Comprised of Top N Bigrams

top_n_bigrams	percent_of_corpus
56881	50%
6001743	90%
10653786	95%
14375420	99%

Conclusion

I need to do more work on efficiently handling and loading trigram data – even now, the bigram operations take quite a while. The ideas I plan to implement going forward are based primarily on two sources. The first is Arnaldo Pedro Figueira Figueira’s coursera videos, which can be found on youtube. In them he describes bigram and trigram models, evaluating a model using perplexity, and dealing with the limitaions of the data set by using discounting. The second is Andrew Ng’s deeplearning.ai curriculum wherein he introduces the idea of word embeddings. I’m curious to see if, with reasonable computational effort, word embeddings can be used to increase the accuracy of the prediction model where the input text has very few examples in the corpus (through similarity mapping). Thanks for reading, and I’m looking forward to your feedback. Good luck with your project!