The goal of the Data Science Capstone Project is to create a predictive text algorithm using as a starting point a collection of English corpora consisting of blog entries, news articles and twitter tweets. The resulting algorithm will be deployed as a web app, via ShinyApps.
Although the data sets are not very large compared to those commonly used in the field of big data, it becomes very demanding to manipulate for a single standard laptop computer. The following summary displays the basic characteristics of the raw data sets as originally available in three separate corpora: blogs, news and twitter.
| Size | Documents in millions | Characters in millions | Average number of characters per document | |
|---|---|---|---|---|
| Blogs | 248.5 Mb | 0.9 | 206.82 | 229.99 |
| News | 249.6 Mb | 1.01 | 203.22 | 201.16 |
| 301.4 Mb | 2.36 | 162.10 | 68.68 |
To provide an exploratory analysis of the data I created a random sample of the original data. To create it we used a random binomial distribution with probability 0.5, which extracted 10% of the entries from each of the original corpora.
sample <- raw_data[rbinom(length(full_size_corpus)*0.1, length(full_size_corpus), 0.5)]
There are a few alternatives to transform texts before performing an exploratory analysis and indeed before creating a predictive algorithm. we made use of mostly the functions within the qdap and tm packages in the following order:
Another powerful tool to inspect a language is stemming, retrieving the radical of a word so the various forms derived from the same stem are grouped together. For example, update, updated and updating could all be stemmed to updat. This is a good approach to reduce the size of a corpus and inspect the frequency of stronlgy related terms. It does not seem very useful to provide natural language predictions that could be used in real life, as the stems of words cannot be used (e.g. updat), although it is actually possible to revert back the stemming, for example applying to the stem the most common ending that appears in the corpus. However, for a standard laptop it requires exponentially increasing time, as you can see in the graph below, therefore, I decided not to apply stemming nor stem completion.
The distribution of single words can be easily overwhelmed by the most common stop words, like to, the, that or but. Therefore, it is interesting to contrast the distribution of the most used single words with and without stop words, in which case the top 200 stop words were removed. Stop words must, however, be included in the predicting model, but their removal can provide a helpful cluster analysis of common terms that tend to appear together, while removing the noisiness of stop words that appear everywhere.
|
|
|
Comparison of the distribution of frequencies - in a log10 scale for clarity - including and removing stop words shows how the inclusion of stop words skews the distribution of most frequent words. Including stop words the 750 most frequent words make up for approximately 50% of all word frequency in a corpus of little more than 25,500 terms. Removing stop words doubles that figure, to reach the 50% frequency one requires approximately the 1,500 most frequent words. This in line with the idea that in order to gain command of a language the first thousand words allow for the understanding of approximately 80%, which is the case of English - e.g.Davies (2005). Note, that in this corpus the results might not be as precise due to lack of specific cleaning techniques, like stemming.
To create a text predicting model it's very important to identify the most common relationships between words, which can be analyzed using n-grams of different sizes. Then the plots below show the 10 most common relationships for grams of sizes two, three and four.
|
|
|
|
The frequency of appereance of a given gram diminishes significantly from 1-grams to 2-grams and again to 3-grams and 4-grams, while the number of possible combinations increases 15-fold from 1-gram to 2-grams, and starts decreasing significantly on every gram after that. There are approximately 2.2 more combinations of 3-grams than 2-grams, and only 1.4 more combinations of 4-grams than 3-grams. In any case, although diminishing, it results in larger and larger sets of data.
| Frequency table | 1-gram | 2-grams | 3-grams | 4-grams |
|---|---|---|---|---|
| Number of grams | 25,844 | 405,943 | 926,318 | 1,370,239 |
| Size | 1.7 Mb | 28.3 Mb | 70.8 Mb | 113.3 Mb |
It is noticeable, however, that although there are more possible combinations as the number of grams increases, the frequency of the least common combinations plummets after a certain point - approximately after 10,000 unigrams, and beyond 100,000 bigrams, trigrams and fourgrams. This characteristic can be used to reduce the size of the final data set, which represents in fact a major issue. Since the algorithm has to be deployed via a web app, there is a trade-off between the amount of data available to make the algorithm as fast as possible and its predicting power.
|
|
|
Making use of the n-gram frequency data the goal is to predict the next term in a given input text based in the maximum number of grams available. To be specific, split the words of a given sentence and inspect the last 3-grams available, which will be feeded into the 4-gram frequency data that will return the available results and select the one with the highest frequency. If there are no 3-grams available, reduce it to 2-grams and inspect the 3-grams frequencies to make a prediction, and so on. There is on need to keep separate tables for different n-grams since a any n-gram combination can be found in the n+1-gram table, so 1, 2 and 3-grams combinations can be found in a four-gram table.
For example, entering the input text >> Predicting text is easy, you just have to … << would start by examining the last 3-gram available - just have to- in the 4-gram table to see if there are matches. If there is more than one match it will take the most frequent one, in this case: let.
| four_grams | frequency |
|---|---|
| “just have to” let | 0.0007665871 |
| “just have to” look | 0.0000582218 |
| “just have to” quit | 0.00003881453 |
In case of no matching pattern, it would backoff to the 2-grams - have to - and look it up using 3-grams, which arranged by frequency would predict in this case: come. Note that the 2-gram sequence can be found at the beginning or end of the 4-gram sequence.
| three_grams | frequency |
|---|---|
| “have to” come to | 0.002455019 |
| “have to” admit that | 0.002416205 |
| i “have to” say | 0.001853394 |
| …. | …. |
In case of finding no matches it would move to a the 1-gram - to, where, again, the sequence could be found at the beginning, centre or end of the original 4-gram patterns.
Addtionally, it is important to take into account punctuation that determines new sentences, such as periods, semi-colons or exclamation marks. For example, in the case of >> Predicting text is easy. First … << although the last 3-grams available - is easy first - could be looked up, it would be more meaningful to start after the period and look for the 1-gram - first.
Lastly, and ideally, for a given text input, it should be possible to establish word clusters that could be used to obtain predictions suited to the topic of the text as a whole, in combination with the n-gram prediction.
To complete the project the following steps are projected: Build a basic Shiny app that initializes itself by preloading a file of n-grams (n = 1 to 4) from the hosted Shiny app storage or alternatively external storage, such as github, assuming this is feasible. All n-gram file preparation will be done locally and prior to uploading the app to shinyapps.io, as described in section “Building the n-gram tables”. The app will preload a file that is optimized for use “at prediction time”. The files will be .RData or in comma separated values format. Columns include the predictor (the (n-1)-grams), the predicted value (the last word of the n-gram, frequency, and the computed conditional probabilities (maximum likelihood). Test acceptable file sizes and impact on Shiny usage, as free Shiny accounts limit computing time. Preliminary testing shows that it is possible to loading an 11 MB .RData file (including 4 data frames of all n-grams, n = 1 to 4, sample size = 1 percent) in a Shiny app hosted on shinyapps.io. Loading a 45 MB .RData file (sample size = 5 percent) appears not possible and hangs the app. Implement the current basic prediction model in the Shiny app. Experiment to optimize the model to handle unseen n-grams, increasing model efficiency, and increasing accuracy overall.