The Data Science Specialization Capstone Project involves creating a text prediction application based on the textual database supplied by Coursera. This report describes the tasks completed at the end of the second week of the Capstone course:
The English language database supplied by Coursera consists of the text files generated from three different sources - news feeds, blogs and twitter feeds. The files are quite large as is evident from the table below:
File SizeBytes LineCount WordCount CharCount
en_US.blogs.txt 210,161,664 899,288 37,272,578 210,160,014
en_US.news.txt 205,815,808 1,010,242 34,309,642 205,811,889
en_US.twitter.txt 167,108,608 2,360,148 30,341,028 167,105,338
In order to utilize the text mining capabilities offered by the tm package one needs to load the contents of the text files into a specific kind of data structure called a “Corpus”. There are two kinds of “Corpus” structures available in tm package - virtual (VCorpus - the contents are held in memory) and permanent (PCorpus - contents are written to disk). Given the large size of the text files the PCorpus structure seemed to offer a way to deal with the entire dataset without having to deal with concerns about memory use. However it was soon evident that this was not the case. The data to be processed still needed to reside in memory as the PCorpus structure only offered a way to store the corpus data for later retrieval. Furthermore, reading the corpus back from the disk did not get one back to the same structure that was written to disk. Therefore it was necessary to read the data being analyzed into memory using the VCorpus data structure.
Given the large size of the text source files, it was necessary to process the data in three steps as illustrated in the figure below:
Processing text data to generate ngram data
Each source file was split into smaller files using the function bigreadr::split_file. After some trial runs it was determined that file larger than ~10MB caused memory issues with the JVM - increasing the Java heap space by changing the java.parameters settings did not seem to help. Therefore the number of split files was set to 20. Each split file was then processed thru’ the cleanup and analysis process.
Each smaller text file was first read into a virtual corpus object. The text in each virtual corpus was processed through a set of transformations using the tm_map function:
Next the function tm::TermDocumentMatrix was used in conjuction with RWeka::NGramTokenizer to get a complete list of words from the corpus. This list used in two ways:
The tm_map function was then used to remove all these words from the corpus.
Once the corpus text had been cleaned up, the functions tm::TermDocumentMatrix in conjuction with RWeka::NGramTokenizer were once again used to generate Term Document Matrices for unigrams, bigrams and trigrams. Each Term Document Matrix was written out to a CSV file. Each text source file (e.g. “en_US_blogs.txt”) process generates a set 20 files each for unigrams, bigrams & trigrams. The files contain two columns: the ngrams found in each split text file and the number of occurences for each ngram.
Note: The list of unigrams included in the output is not filtered by the number of occurences - i.e. if a word appears only once in text file it is included in the split unigram file. For bigrams and trigrams rows with a count of 1 were filtered out otherwise the resulting datasets were too large. This does lead to a scenario were the split artificially excludes certain bigrams and trigrams since these may have been present in other parts of the source file.
Once the ngram data for each split text file is gathered, this data is stitched together using the R base merge function. As illustrated in the figure below for bigrams, this process consists of two steps:
Merging ngram data from split files and sources
The figure below shows the distribution of ngrams by the number of occurences. Actually, the x-axis shows the log10 of the number of occurences since otherwise the graphs basically look like one tall column at a value of 0 (for unigrams) or 1 (for bi/trigrams).
The sizes of the three ngram data sets, once they are loaded into dataframes are listed below:
ngram size
Unigram 47,537,216
Bigram 129,976,656
Trigram 175,036,320
The total size of all three datasets is around 350 MB which is quite large. As can be seen from the histograms above, the vast majority of bigram and trigram phrases have 2 occurrences. If one filters out these records the sizes of these datasets reduces by more than 50%.
The specific ask in this project is for:
A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.
Given this, the unigrams data set will not be needed since we have not been asked to predict the word as it is being typed.
There three separate cases that need to be considered:
In the first two cases we can use a grep search to suggest the next word based on the bigram or trigram that we were able to match the last word or the last two words.
We can try this approach out using a simple “findNgram” function that I have implemented that takes one or two words as an input and returns that most probable next word based on the number of occurences of bigrams or trigrams that have starting words that match the input.
findNgram("hello")
Time difference of 0.6162181 secs
[1] "to"
findNgram("hello and")
Time difference of 0.5237479 secs
[1] "welcome"
As is evident the total time to seach the bigram and trigram datasets is under a second which seems reasonable given the specific ask of the project.
In the case where we do not find a match in the bigram or trigram datasets we need to switch to a different algorithm. The most suitable approach would be a k-nearest neighbor classification of the corpora to get the next likely word based on a distance metric. I have not had the chance to carry out a detailed analysis of the k-nearest neighnbor approach to judge its feasibility in terms of memory use and speed.
The k-nearest neighbor approach is possible only if the word that is typed in exists in the unigrams dataset to begin with. If the word that is typed in does not match any of the words in the sample text data supplied for this project then it may be necessary to implement partial grep searches on the trigram and bigram datasets to come up with suggestions. This is another area that needs more work.