The Capstone project will apply data science in the area of natural language processing to predict the next word based on input of one or more words. The course provides a starting set of training data from HC Corpora using the English language sources from blogs, news articles and Twitter messages. The purpose of this report is to describe the exploratory analysis that’s been done and my goals goals for the eventual app and algorithm.
The training data files can be read
blog <- file("en_US.blogs.txt", "r", blocking = FALSE); blog_lines <- readLines(blog); close(blog)
news <- file("en_US.news.txt", "r", blocking = FALSE) ; news_lines <- readLines(news); close(news)
twit <- file("en_US.twitter.txt", "r", blocking = FALSE); twit_lines <- readLines(twit); close(twit)
and we can show some statistics for the three files - numbers of lines, characters and words and the length of the longest line:
| Source | Lines | LongestLine | TotalChars | Words |
|---|---|---|---|---|
| en_US.blogs.txt | 899,288 | 40,835 | 208,361,438 | 38,171,210 |
| en_US.news.txt | 77,258 | 5,760 | 15,639,335 | 2,662,070 |
| en_US.twitter.txt | 2,360,148 | 213 | 162,385,044 | 30,657,973 |
Using a random sample with 1% of the lines from the input files we can build a table - a corpus - of the words by the frequency of their use. From a full corpus of all words use we create a couple of variations. Stopwords are common words with little information content. We will look at frequencies with stopwords removed. Stemming is a process to reduce words to their word stem, aggregating some word. The plot below shows the frequency of the 5000 most frequent words (on a log/log scale) for the full corpus, after the list of stopwords has been removed and after the stemming process. Removing the stopwords greatly changes which words are most frequent. At lower frequencies the distributions of the three sets are not so different.
Here we show the most common word frequencies for the full corpus, with stopwords removed and after the stemming process:
Word cloud plots where text size shows word frequency for the 100 most frequent words shows some differences. The full corpus is dominated by the stopwords while the clouds after stopwords are removed and after stemming are more similar:
We can also find patterns of more than one word. N-grams are sequences of n number of words that might help with text prediction, so a bi-gram or 2-gram is a two word string, a tri-gram or 3-gram a three word string and so on. The most common bi-grams are shown here by frequency:
and the most common tri-grams:
Because some word are so frequent (especially the stopwords), it takes only the 326 most frequent words to sum to 50% of the word use while it takes 2470 words to cover 90% of word use. This is roughly consistent with the results of research around the General Service List that about 2800 word families are needed for 90% text coverage in English.
My general intent is to use the concept of Markov chains with 4, 3 and 2-grams from the training set to predict where the final word in the string is the predicted word - a fixed order Markov assumption. To expand coverage beyond that I’d like to find a way to include the notion of power law scaling - which says that there are a small number of words that occur disproportionately frequently (e.g. the, to, of) and a very large number of rare words that, although each occurs rarely, when taken together make up a large proportion of the language. The concern about the large number of rare words is that we’ve been warned about constraints on processing time and memory space. I’ve done a bit of reading on back-off strategies and smoothing methods. In truth I don’t know how to implement either concept at this time even though it sounds like these methods could be quite important to avoid overfitting. I’d welcome any comments and suggestions.