Author: Russ Robbins
After I did this work I learned better methods. If I have time, I will reapply these new methods to this analysis. I expect that the new methods will yield better accuracy, and better linkage between 1 and 2 grams, and 2 and 3 grams, and so on.
The analysis here focuses on three files:
This work is exploratory towards a more important goal. The overall goal of this project is to build a word prediction algorithm and encapsulate that model in a data product that will allow a user to understand how the algorithm can prospectively help SwiftKey. SwiftKey is a company that provides predictive technologies for easier mobile typing. The analysis shown here is expected to inform the algorithm that I am developing.
This document provides word counts, line counts, and other basic data that describes the information in the blogs, news, and twitter files. Further, it provides basic plots such as histograms to illustrate the features of the data.
To provide context, Three example “lines” from the files follow.
Table 1 shows the total number of lines and words in each of the files. It also shows the number of distinct sets of one word and two, three, four, and five word phrases that were found in the data. In word prediction, these words/phrases are sometimes referred to as unigrams, bigrams, trigrams, 4grams, and 5grams. Examples of these types of ngrams follow.
| lines | words | unigrams | bigrams | trigrams | 4grams | 5grams | |
|---|---|---|---|---|---|---|---|
| blogs | 899,288 | 34,709,465 | 14,577 | 29,865 | 13,581 | 2,053 | 217 |
| news | 1,010,242 | 31,028,342 | 15,308 | 28,842 | 10,087 | 1,245 | 153 |
| 2,360,148 | 27,048,689 | 10,254 | 153 | 9,570 | 1,527 | 190 |
In Table 1, note that the bigrams are roughly double the number of unigrams and then the trigrams, 4grams, and 5grams get progressively smaller. This was different than I expected since as you add more possible combinations the numbers of each ngram you get more possibilities. However, this “more possibilities” idea implies that words are not interdependent upon each other, which we know is not the case (e.g., a noun needs a verb, adjectives/adverbs are modifiers of other words). This implies to me that, at least as represented by this data, the “vocabulary” of phrases that is common across the persons who blogged, wrote news, or tweeted is limited. This is helpful as the course product is focused on predicting words in phrases.
| min | max | mean | median | mode | |
|---|---|---|---|---|---|
| blogs | 1 | 1,852,821 | 2,381 | 301 | 101 |
| news | 1 | 1,966,998 | 2,027 | 307 | 103 |
| 1 | 934,525 | 2,638 | 315 | 102 |
Another way to describe the data is to understand the number of instances of unigrams in general. Table 2 shows the minumum, maximum, mean, median, and mode number of instances for particular unigrams in the blogs, news, and twitter data. You will notice that there is a large range in the number of times that a unigram can occur in either the set of blogs, the set of news, or the set of twitter lines. For example, certain unigrams, such as the word ‘the’ appear a very large number of times (e.g., in the blog data it occurred 1,852,821 times). Therefore a description of how the high usage of certain words can dominate the actual number of words used is in order.
| 10% | 30% | 50% | 70% | 90% | |
|---|---|---|---|---|---|
| blogs | 2 | 15 | 74 | 426 | 1,131 |
| news | 2 | 19 | 118 | 745 | 1,704 |
| 3 | 21 | 79 | 325 | 755 |
Another way to describe the features of the blogs, news, and twitter files is to study the length of the unigrams that were used in these online sites. Again, these three types of shows similarity. Unigrams that have lengths of three are the most frequent, and in the case if twitter unigrams of four letters are also very frequent. Understanding unigram length may be important for developing and using an algorithm that runs as a user types a word. Using information about the distribution of the length of words in general can suggest when a user is likely to stop typing a word, and thus, how close the algorithm’s current understanding of the word (e.g. “lov”) is to the actual word (“love”). This in turn can allow the algorithm to begin focusing on other aspects of the prediction problem. Figure 2 presents a histogram of word lengths for blogs.
Finally, Figure 3 shows similar information but instead of using count data to create a histogram, the data is categorized into ranges (e.g., “between 1 and 9 words”, “over 1 million words”) and these ranges are used to show the dispersion of the actual unigrams in those ranges. The x-axis has categories that indicate the number of instances a unigram had. The y-axis indicates for each of those words, its respective length. Looking at the first column we can see that for unigrams that occured between 1 and 9 times, their lengths ranged from 1 to 4 letters. Similarly (on the far right of the figure) we can see that words that were used more than 1 million times either had a length of 2 or 3. This corresponds to the “the”, “and”, and “to” mentioned above. Places in the columns were the dots are more dense implies that many more words were at that intersection of volume and length.
I plan to focus my efforts predicting phrases that include the most common words, since it is likely these will be the words selected by future prospective users. Further I plan on using classic approaches to predicting ngrams. These approaches include conditional probability, smoothing, and interpolation/backoff. Conditional probability refers to how likely a word is given the previos word, Smoothing addresses the likelihood of words that were not in the data we used to compute conditional probabilities. Interpolation/backoff is the process of using subsets of our current ngram to predict the next word.
If I am successful using classic Ngram prediction methods, I may also look into part of speech tagging and test whether knowing a words part of speech (e.g., adjective) is helpful for filtering the set of possible next words. Concurrently I plan to refamiliarize myself with Shiny, and build a skeleton app that addresses space and computation constraints imposed by using the Shinyapps platform. I think I will also investigate using Python for NLP to supplement R’s capabilities, if necessary.
If you could provide your thoughts on this work and my plan I would be very appreciative. Thank you so much for taking the time to review and react to this report.
This appendix is not part of the report but it is included in case anyone wants to look at comparable charts for news and twitter.