This is the Milestone Report prepared and submitted as part of the Coursera/Johns Hopkins Data Science Specialization Capstone Project. The aim of the Capstone is to demonstrate the use of Natural Language Processing (NLP) by identifying and training a model that can be used for next-word text prediction. The data set for identifying and training the model was obtained from the HC Corpus (www.corpora.heliohost.org) and consisted of three .txt English language data files (en-US.news.txt, en-US.blogs.txt and en-US.twitter.txt). In addition, for profanity filtering, another text file of words considered profane (swearWords.txt) was obtained from http://www.bannedwordlist.com. This milestone report presents the results of initial studies carried out into the nature of the data contained in these files and their suitability as a training data set. This was assessed by loading the data from file, carrying out preliminary data cleaning/filtering, formation of n-grams of various lengths and an exploratory data analysis. This milestone report also presents the next steps planned in order to realise the ultimate goal of implementing a predictive text web app. N.B. in order to make this report more readable for a non-Data Scientist audience, the R code used to perform this analysis has not been echoed in the report.
As instructed in the project rubric, the text data to be explored were downloaded as a zipped file from the Coursera website and unzipped manually in the project directory. As part of the exercise, a list of profanities to be filtered from these text data sets was located by internet search and obtained from http://www.bannedwordlist.com/swearWords.txt . This file was separately downloaded and also included in the data directory.
The initial processing first set up the environment by loading the required libraries and then the three text data files containing the twitter, news and blogs text information and the text file containing the defined profanities. Finally, it tabulated some basic statistics on each using the {stringi} string_count_words() function and then cleaned up memory.
Size | Lines | Max.Chars.per.Line | Mean.Words.per.Line | Max.Words.per.Line | Total.Words | |
---|---|---|---|---|---|---|
en_US.news.txt | 196.2775 | 1010242 | 11384 | 34.4 | 1796 | 34762395 |
en_US.blogs.txt | 200.4242 | 899288 | 40833 | 41.8 | 6726 | 37546246 |
en_US.twitter.txt | 159.3635 | 2360148 | 140 | 12.8 | 47 | 30093409 |
swearWords.txt | 0.0005 | 77 | 12 | 1.1 | 4 | 85 |
Table 1: Text input file statistics
Examination of the summary statistics (shown in Table 1) for the input text files revealed the potential magnitude of the problem of exploring the frequencies and connectivities of words within these data sets. This is because of their large sizes - particularly the number of words, which exceeds \(30\times10^6\) in each of the three HC Corpus input files. As an aside, it is reassuring that the maximum characters per input line found for the twitter data is in fact 140 - the maximum characters that can be included in a tweet.
As an initial step in the data processing, it was decided to randomly sample the HC Corpus data sets in order to limit the memory and processor load which could be expected from attempting to process the entire data sets. This was done after setting the random seed in order to ensure reproducibility. The same proportion, two percent, of random samples were taken from each of the three data sets (specified as a decimal fraction of the total data set size). These three random samples were then combined into a single (randomly ordered) subset and memory cleaning performed. Two percent was selected since it was found that processing larger samples with the java based {tm} and {RWeka} packages was not practical within the hardware available owing to the time required and memory constraints.
At this point in the processing it was necessary to make a high level decision concerning the approach to the predictive model. It was decided that, as the testing will be based on completing given phrases, an n-gram based approach would be the most appropriate. I.e. an approach based on connecting sequences of words in order to predict a specific next word in each case rather than a “word-bag” based approach. Such a word-bag based approach might be more appropriate for other types of prediction, e.g., emotional state or mood. Following on from this logic, it was also decided that during data cleaning stop-word removal and stemming would not be performed since these would result in loss of potentially important information concerning exact word sequencing in the n-grams.
Therefore, for the next step, the randomly sampled subset was subject to sentence detection and splitting using the {qdap}sent_detect() function. This worked by determining sentences on the basis of the standard punctuation markers (. or ? or ! etc.). This approach lends itself to n-gram based prediction since if complete sentences can be detected and separated it prevents the formation of n-grams from otherwise semantically unconnected words spanning either side of the end of a sentence. Since this step is computationally intensive, parallel processing was enabled for this step.
Next the corpus of individual sentences was built using the {tm} VCorpus() (for Volatile Corpus) function. Once this was done, the {tm} tm-map() function was used to (i) set all cases to lower, (ii) remove numerics (including times & dates), (iii) remove punctuation marks and (iv) strip out any resulting white spaces. This cleaned corpus was then further processed through {tm} tm-map() to remove all instances of profane words, as defined in the swearWord.txt file.
Next, the {tm} DocumentTermMatrix() function was used to remove all instances of empty documents from the cleaned corpus (which resulted from application of the previous steps). The cleaned corpus was then converted back to a data frame suitable for further n-gram processing (using {RWeka}) and summary descriptive statistics prepared. Finally, memory was cleared.
No.of.Sentences | Max.Chars.per.Sentence | Mean.Words.per.Sentence | Max.Words.per.Sentence | Total.Words | |
---|---|---|---|---|---|
Pre-clean | 31216 | 665 | 12.7 | 123 | 397892 |
Post-clean | 28767 | 650 | 13.5 | 122 | 387195 |
Table 2: Random sub-set statistics
Table 2 contains descriptive statistics concerning the make-up of the sentences detected in the randomly sampled data set. These statistics are given both pre- and post-cleaning. It was these post-cleaning sentences which were then used in the synthesis of n-grams. As can be seen in Table 2, even with random sampling at two percent and after data cleaning, the total number of sentences remaining prior to tokenization and n-gram computation was approx. \(29\times10^3\) and they contained a total of over \(387\times10^3\) words. It was therefore judged to be a reasonable sample for analysis.
Data frames containing all 1, 2, 3 and 4 word n-grams from each individual sentence in the cleaned data frame were separately prepared using the appropriately parametrized {RWeka} NGramTokenizer() function.
An initial exploratory analysis of the results of n-gram computation is presented in Figure 1. The figure contains 4 bar charts, 1 for each of the 1,2,3 and 4-gram computations. Each bar chart presents the 40 most prevelant n-grams of the appropriate size within the data set. The x-axis gives the percentage of total n-grams made up by each of the 40 n-grams (the total number of detected instances of n-grams is given in the title of the bar chart in each case). The n-gram itself can be identified as the appropriate y-axis label.
Figure 1: Top 1,2,3,4-grams
Of course, the 1-grams presented in Figure 1 are just the 40 most prevelant words from the randomly sampled data sub-set. The most prevelant of which (at > 5%) is, unsurprisingly, “the” (recall that the stop-words were deliberately not removed from this data set). From the title of the 1-gram bar chart it is interesting to note that the total number of unique words detected is over \(32\times10^3\). From the titles of the 2, 3 & 4-gram bar charts it can be seen that, as expected, the number of n-grams rapidly increases as n increases. For the 4-gram results there are over \(380\times10^3\) unique instances. One very important point to note is that of the n-grams presented in these bar charts (y-axis labels) none contain non-alpha numeric or foreign characters. Another is the instance of the 4-gram “thanks for the rt” - this is of course a tweet (rt=return tweet) - as such, although it is not grammatically correct English, it is a legitimate text prediction and so the decision was made to leave such acronyms untouched.
Figure 2 is a line chart presenting the percentage coverage of the complete set of n-grams (y-axis) as a function of the number of n-grams incuded (in the sequence most to least prevelant) for all 4 sets of n-grams. The x-axis is shown as a percentage of the total, in each case, in order to bring all 4 sets of n-grams on to a common scale.Figure 2: % N-Gram coverage
Examining the results presented in Figure 2 reveals the stark contrast between the 1-gram (ie individual word) and 2, 3, 4-gram results. For the individual words 50 percent coverage is achieved with only 1 - 2 percent of the unique words present and even 90 percent coverage only requires approximately 20 percent of the unique 1-grams (words). Even for the 2-gram results, over 80 percent of the unique 2-grams are required for 90 percent coverage and for the 3 & 4-gram results, close to 90 percent of the unique n-grams are required. This must mean, of course, that the majority of 2, 3 & 4-grams within the data set are themselves unique. Which may be an indicator of good predictive power, since if the majority of 3 & 4-grams are unique then matching a 2 or 3-gram will likely lead to only a few possible predictions for the next word in the sequence. Overall, these results can be seen as encouraging.
Following on from the results presented here, a number of steps are needed to realise the final goal of producing a shiny web-app capable of text prediction. These can be summarised as:
Attempt to further optimise the percentage of the text database processed to increase the coverage and accuracy of the n-gram based prediction. Given the constraints imposed by the java based {tm} and {RWeka} packages (memory and processing time), this will likely involve use of other packages instead, such as {quanteda} or {stylo}, which are not java based.
Work out how to best perform text prediction on the basis of the n-grams produced.
In particular, look into a back-off and/or smoothing strategy for attempting prediction from 4, 3, 2 and 1-grams. The latter may seem non-sensical, but once a predicted word has been identified from the 4, 3 or 2-gram then the 1-gram may be used to determine the overall frequency of the predicted word and hence a measure of the overall likelihood of occurence.
Optimise the memory requirements and speed of execution of the final selection of algorithm and n-gram databases to be used in the shiny web-app.
In particular, investigate how to optimise the processing on the shiny server in terms of table look-up for n-grams from a pre-loaded data structure prepared by off-line processing of the corpus. This will likely entail extensive use of {data.table} routines within the web-app.