The objective of the capstone project is to develop a text predictive data product. The purpose of this document is to report on the progress made at this particular milestone. The report will describe what data was acquired and how it was cleaned. Followed with an exploratory analysis of the cleaned data and finally conclude with suggestions as to the possible models to move forward with in order to completed the project. The data used is a collection of text documents, called a Corpus. The corpus that will be used for this milestone report has been made avaialable by HC Corpora through the Coursera website. Although other datasets may be used for the completion of the project this particular corpus will be used for this milstone report.
Understanding the characteristics of the acquired data is important, as it will elucidate as to how the data should be cleaned and preprocessed for analysis.
The documents downloaded are zipped text files. The text files are grouped into folders by language. The folder of interest to us will be the English US folder. In this folder there are three files, text documents, that contain text gathered from three sources - blogs, news and twitter.
The following table outlines the size of the files on disc, the size of objects in the R environment and the number of lines each document has.
| Source | SizeOnDisc | SizeObject | NumberLines |
|---|---|---|---|
| Blogs | 200.42 Mb | 248.49 Mb | 899288 |
| News | 196.28 Mb | 249.63 Mb | 1010242 |
| 159.36 Mb | 301.78 Mb | 2360148 |
In any prediction model a training/sample set is used to train a model so to compare against a test set for validation. In this instance the size of the files and number of lines gives us an indication as to the size of the corpus that can be sampled. Generally 60% of the data set is used for sampling and the rest for testing. The problem is the data set here is rather large so a much smaller percentage would be needed. In this instance 1% of the corpus will be used as a training/sample set. The following table shows the resulting size of the randomly sampled data and the word count for the sample and the estimate word count for the corresponding file.
| Source | SizeObject | NumberWords | EstimateTotalWords |
|---|---|---|---|
| Blogs | 2.51 Mb | 373603 | 37360300 |
| News | 2.51 Mb | 337083 | 33708300 |
| 3.05 Mb | 293887 | 29388700 |
The text of the sample sets are presumed to be made up of alphanumeric characters along with punctuation. What needs to be explored is what other characters have been included that are not obvious and that do not add to the predictive ability of the model. As an example, a foreign word, which in essence should not have any value in a english setting. A second example are control characters that may be included, due to the way R or whatever other program collected, interpreted and represented the data.
In order to assess what other characters are included, the sample sets are be combined into one large sample. There are two regular expression lists included in the base package of R, these are [:punct:] and [:cntrl:]. These lists are used to detail and extract the punctuation and control characters from the sample. The following small table shows the number of unique non - alphanumeric characters found using these given lists.
| PunctuationList | ControlList | |
|---|---|---|
| NumberOfCharacters | 80 | 34 |
The following gives an example of the characters found from the combined sample using the two lists, that possibly do not add any value towards a predictive text model.
| PuncList | PuncFreq |
|---|---|
| ’ | 4880 |
| ‚ | 2 |
| " | 10973 |
| “ | 1528 |
| ” | 1510 |
| « | 11 |
| » | 20 |
| ( | 3788 |
| ) | 4992 |
| [ | 43 |
Preprocessing involves loading the sample sets in as a sample Corpus and then cleaning up the text in order to analyse it. The method used to clean up the text is important as it has a large bearing on the usefuleness of the model.
Cleaning is about considering what is valuable in the text and getting rid of the text that does not have value. In the previous section it was noted how vastly different characters have been included in the text. So the first cleaning process would be to get rid of the punctuation and character control characters found in those lists, including website url’s.
Numbers do not add to the predictive value so they are removed next. Following this what should remain is just text. Considering the case sensitive nature of text mining, the next prudent step would be to drop all the text characters to a lower case. No capital letters may detract from proper nouns but the name of someone or some place does not necessarily add to predictive capability.
The next step is interesting, according to text mining literature and something called Zipfs Law. The most frequent words may not possess a lot of value in prediction. Words like ‘and’, ‘the’ and ‘it’. The suggested reason is due to the fact that since they are so ubqiuitous anything could come after it. Therefore no real value for prediction. These are called stop words, and cleaning them out is the next step in the process. Following this are profanities which due to their use, generally does not add to predictive capabilities, for example ‘What a stupid…’, anything could follow. So they have the same usefulness as stop words. The tm package in R has its own defined set of stop words, instead the following list has been used SMART stop words and have sourced a list of profanities as well.
The next step is stemming (or lemmatization). There are certain words with inflections that dont add more value to the radical of the word. As an example, consider the phrase ‘firmly holding..’. It is understood that it is something done in the present tense, but from a predictive value point of view it does not add much to ‘firm hold’. Note the inflections are removed and the idea still remains. This is the idea behind stemming, strip off the inflections and keep the radical - with the expectation - that the relationship between the words still implicitly remains.
The final steps are to strip out white space, which aid in identifiying word boundaries. Then convert the document into a plain text document, mostly to ensure that it is in the simplest format to use.
Tokenization is defined as taking a string and breaking it up into smaller parts. The parts could be, words, phrases or radicals of words as examples. Tokens are then used as the building blocks in understanding how text is structured and how tokens are related to each other. Therefore the objective is to understand what tokens to use and how they appear in the text and with what frequency.
Getting a sense of token frequency helps in understanding what amount of the corpus is represented by such high frequency tokens. Considering these are not stop words, they have value, they indicate the more utilized part of speech for the given medium. The graph below (Fig. 1), displays the number of tokens and their cumulative frequency. In short it shows the number of tokens required to represent 50% of the sample and 90% of the sample.
Fig.1 Token or word type percentage representation of sample.
Astoundingly only 770 tokens are required to represent 50% of the sample. As expected an exponential increase in tokens to 9600 tokens are required to represent 90% of the sample. Still, this is astounding, as the sample has 4.3601910^{5} tokens altogether. Therefore the number of representative tokens is rather small in proportion to the number of overall tokens.
Given this fact, it would be instructive to know what the top words by frequency are. The graphic (Fig.2) represents the top 40 words by frequency count.
Fig.2 Top 40 tokens or word types by frequency count.
Predictive text requires not only the frequency of unigram tokens, but also the frequency of relationships between tokens. This relationship means that we simply look at combinations of contiguous tokens. The following two graphs (Fig.3 and Fig.4), display the top 40 most frequent ‘bigrams’ and ‘trigrams’ ( combinations of two and three contiguous tokens). The frequency of bigrams and trigrams can be used like a map for predictive text. Given the likelihood of a word, the likelihood of an associated bigram can be found, which then gives the likelihood of a trigram being found. Obviously this is all under the assumption that the bigram and trigram exist. The non-existence of an n-gram ( n-combinations of tokens) is something that is discussed in the conclusion.
Fig.3 Top 40 bigram tokens or phrase types by frequency count.
Fig.4 Top 40 trigram tokens or phrase types by frequency count.
There are four issues to consider for the final project.
This report was based on the corpus that was kindly made available through the Coursera website. This corpus was a great point of departure in understanding the basics of text mining. The point to consider here is what other corpi ( I assume is the plural) should be included to improve the coverage of words. Further, this report used 1% of the given corpus, so ideally, even though it was a random sample it still may be too small. Either a bigger sample shoud be taken or at least more samples should be used of equivalent size.
This is a little easier as it was suggested by coursera that an n-gram model be used and thus the basis for the report. However the implementation of the model and data structures to be considered is the real challenge.
There will definitely be word collections given to the application that never existed as an n-gram in at least the samples or even the corpus used to create the models. Again guidance was given and backoff models can be used to give weighting to n-grams that have not been seen by the model. These models are a part of the general smoothing approach which includes the likes of the Laplace method wich simply adds an instance to the number of outcomes and adjusts by the sum of all word types. Still the backoff model will be used as suggested, since a cursory look through the literaturs shows that these models seem to work well and are equally easy to scale.
The final issue is to consider how to store the data used to predict text. The first constraint is the limited size allowed by the server for the application, but still modile phones have great predictive text. The second is a trade off between performance and accuracy, the better the performance ( in terms of speed) the worse the possible accuracy and vice versa.