Data Science Capstone Project: Milestone Report

Introduction

The Data Science Specialization Capstone Project involves creating a text prediction application based on the textual database supplied by Coursera. This report describes the tasks completed at the end of the second week of the Capstone course:

Use of the tm package to create corpora from the text files.
Pre-processing the text in the corpora to remove punctuations, profanities & “nonsense” words.
Tokenization of the cleaned up text into unigrams, bigrams and trigrams.
Design the basic method to implement text prediction.

Analysis of source data

Generating corpora from the text files

The English language database supplied by Coursera consists of the text files generated from three different sources - news feeds, blogs and twitter feeds. The files are quite large as is evident from the table below:

              File   SizeBytes LineCount  WordCount   CharCount
 en_US.blogs.txt   210,161,664   899,288 37,272,578 210,160,014
 en_US.news.txt    205,815,808 1,010,242 34,309,642 205,811,889
 en_US.twitter.txt 167,108,608 2,360,148 30,341,028 167,105,338

In order to utilize the text mining capabilities offered by the tm package one needs to load the contents of the text files into a specific kind of data structure called a “Corpus”. There are two kinds of “Corpus” structures available in tm package - virtual (VCorpus - the contents are held in memory) and permanent (PCorpus - contents are written to disk). Given the large size of the text files the PCorpus structure seemed to offer a way to deal with the entire dataset without having to deal with concerns about memory use. However it was soon evident that this was not the case. The data to be processed still needed to reside in memory as the PCorpus structure only offered a way to store the corpus data for later retrieval. Furthermore, reading the corpus back from the disk did not get one back to the same structure that was written to disk. Therefore it was necessary to read the data being analyzed into memory using the VCorpus data structure.

Split –> Cleanup & Analyze –> Merge

Given the large size of the text source files, it was necessary to process the data in three steps as illustrated in the figure below:

Processing text data to generate ngram data

Split

Each source file was split into smaller files using the function bigreadr::split_file. After some trial runs it was determined that file larger than ~10MB caused memory issues with the JVM - increasing the Java heap space by changing the java.parameters settings did not seem to help. Therefore the number of split files was set to 20. Each split file was then processed thru’ the cleanup and analysis process.

Cleanup & Analyze

Each smaller text file was first read into a virtual corpus object. The text in each virtual corpus was processed through a set of transformations using the tm_map function:

convert all chararcters to lowercase
remove all non-alphabet characters except the single quotes (’) - this would retain word contractions such as “i’ll”, “we’re” etc.
remove all single alphabets and pairs of identical alphabets occuring in isolation
remove leading and trailing single-quotes and pairs of single quotes
remove all extra whitespace

Next the function tm::TermDocumentMatrix was used in conjuction with RWeka::NGramTokenizer to get a complete list of words from the corpus. This list used in two ways:

identify words where the same alphabet was repeated three or more times in a row
identify words that were tagged as profanities or ones that contained profanities as a substring

The tm_map function was then used to remove all these words from the corpus.

Once the corpus text had been cleaned up, the functions tm::TermDocumentMatrix in conjuction with RWeka::NGramTokenizer were once again used to generate Term Document Matrices for unigrams, bigrams and trigrams. Each Term Document Matrix was written out to a CSV file. Each text source file (e.g. “en_US_blogs.txt”) process generates a set 20 files each for unigrams, bigrams & trigrams. The files contain two columns: the ngrams found in each split text file and the number of occurences for each ngram.

Note: The list of unigrams included in the output is not filtered by the number of occurences - i.e. if a word appears only once in text file it is included in the split unigram file. For bigrams and trigrams rows with a count of 1 were filtered out otherwise the resulting datasets were too large. This does lead to a scenario were the split artificially excludes certain bigrams and trigrams since these may have been present in other parts of the source file.

Merge

Once the ngram data for each split text file is gathered, this data is stitched together using the R base merge function. As illustrated in the figure below for bigrams, this process consists of two steps:

The 20 split files for each data source (blogs, news & twitter) are first merged into a single unigram/bigram/trigram file for that source.
The unigram/bigram/trigram files for the three sources are merged together to give three files for the entire set of input text data.

Merging ngram data from split files and sources

Ngram statistics and object sizes

The figure below shows the distribution of ngrams by the number of occurences. Actually, the x-axis shows the log10 of the number of occurences since otherwise the graphs basically look like one tall column at a value of 0 (for unigrams) or 1 (for bi/trigrams).

The sizes of the three ngram data sets, once they are loaded into dataframes are listed below:

   ngram        size
 Unigram  47,537,216
  Bigram 129,976,656
 Trigram 175,036,320

The total size of all three datasets is around 350 MB which is quite large. As can be seen from the histograms above, the vast majority of bigram and trigram phrases have 2 occurrences. If one filters out these records the sizes of these datasets reduces by more than 50%.

Initial thoughts on algorithm for text prediction

The specific ask in this project is for:

A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.

Given this, the unigrams data set will not be needed since we have not been asked to predict the word as it is being typed.

There three separate cases that need to be considered:

The last word that was typed is found as the start of a bigram in the bigrams dataset.
The last two words typed are found as the start of a trigram in the trigrams dataset.
The last word typed is not found as the start of a bigram.

Match found in bi/trigram datasets

In the first two cases we can use a grep search to suggest the next word based on the bigram or trigram that we were able to match the last word or the last two words.

We can try this approach out using a simple “findNgram” function that I have implemented that takes one or two words as an input and returns that most probable next word based on the number of occurences of bigrams or trigrams that have starting words that match the input.

findNgram("hello")

Time difference of 0.6162181 secs

[1] "to"

findNgram("hello and")

Time difference of 0.5237479 secs

[1] "welcome"

As is evident the total time to seach the bigram and trigram datasets is under a second which seems reasonable given the specific ask of the project.

Match not found in bi/trigram datasets

In the case where we do not find a match in the bigram or trigram datasets we need to switch to a different algorithm. The most suitable approach would be a k-nearest neighbor classification of the corpora to get the next likely word based on a distance metric. I have not had the chance to carry out a detailed analysis of the k-nearest neighnbor approach to judge its feasibility in terms of memory use and speed.

The k-nearest neighbor approach is possible only if the word that is typed in exists in the unigrams dataset to begin with. If the word that is typed in does not match any of the words in the sample text data supplied for this project then it may be necessary to implement partial grep searches on the trigram and bigram datasets to come up with suggestions. This is another area that needs more work.