For my analysis, I will only be working with English. There are three English corpi, each derived from blogs, twitter, and news sources. Below is a bird’s eye look at their contents, including number of lines, words, characters, words per line and characters per line. Notice Twitter has the shortest lines and the shortest words and that News has the longest words.
## file lines words chars wpl cpw
## 1 final/en_US/en_US.blogs.txt 899288 37334131 208361438 41.52 5.58
## 2 final/en_US/en_US.news.txt 77259 2643969 15683765 34.22 5.93
## 3 final/en_US/en_US.twitter.txt 2360148 30373543 162384825 12.87 5.35
After “sentencing” and “tokenizing” the datasets, we can have a look at the most common words and phrases. Taking the Blogs dataset as an example, here are some of the unsurprising most common n-grams, for n = 1,2,3,4.
## [1] "the" "and" "to" "of" "a" "i"
## [1] "of the" "in the" "to the" "to be" "on the" "i am"
## [1] "one of the" "as well as" "a lot of" "to be a" "there is no"
## [6] "the end of"
## [1] "at the end of" "when it comes to" "turned out to be"
## [4] "its easy to forget" "to be able to" "going to be a"
The word frequencies, as expected, follow Zipf’s law,
I will read in a sample size of n random lines of text from the blogs corpus. I chose the blogs corpus because it seemed closest to “natural” human speech. Twitter was full of typos, slang, profanity, and special symbols, and the messages were too short to be predictive. The News corpus was extremely formal and contained too many big vocab words and proper nouns. Thus, twitter might predict [hand me my hat, p]***er, news might predict [hand me my hat, p]alestine, and blogs might predict [hand me my hat, p]lease. Once I finish my algorithm, I plan on “hooking it up” to each corpus separately, just for fun, to see what kind of mood each captures.
Step one is to “sentence” the text, that is to detect sentence breaks and treat each sentence as a separate object. This is important because the first word of a sentence does not follow the last word of the preceding sentence, and so each line of text cannot be taken to be an unbroken stream of words. The biggest hurdle here is interpreting periods. They usually denote sentence breaks but are also used for titles, abbreviations, ellipses, and so on. Overcoming that, I then strip out all punctuation, convert to all lowercase, and tokenize on spaces. This gives me a true stream of words, each complete stream consisting of a sentence. Thus this stream contains more lines than the original sample.
After sentencing and tokenizing, I use R’s data manipulation functions to create n-grams. The easiest, 1-grams, basically is just a dictionary of words ordered from most to least frequent. 2-grams consists of word pairs, 3-grams of triples, and so on. I will write an algorithm which takes any arbitrary number of levels so that “n” is not hard-coded. Thus I will be able to test and see at what n n-grams stop being predictive (5? 15? 25?). I imagine this number is low, in the single digits. I will store all these n-grams to dat files so that all the pre-processing is not done in real time. The n-gram tables will be considered the raw, searchable data my algorithm will use.
The gist of my algorithm is this: accept a phrase and a pattern, search for instances of this combination in the n-gram files, and return the most likely result, in descending order of n. So for example, if my phrase is “hand me my hat” and my pattern is “p” then we start with the 5-gram table. We will search for the exact phrase “hand me my hat” followed by any word starting with the pattern “p”. If it exists (hand me my hat please) then we are done, and returns “please”. If this phrase is not found, then it checks the 4-gram table for “me my hat p” then the 3-gram table “my hat p” and so on until we simply look in the dictionary for the most common word starting with “p”. Thus, there will always be an answer to any “sensible” input. Note that by “pattern” I am referring to the first x letters of any word.
To benchmark, I will take a sample of text and cut it in random places and feed the cut part into my algorithm and see if it correctly predicts the next word. I will benchmark it against the training data set itself, which should get close to 100% matching since it is the training set which was used to build the ngram tables. I will examine the predictions closely to get it as close to 100% as possible. Then I will benchmark it against test data, where I expect it to correctly guess the next word 20-30% of the time. I will NOT look at the details of the test data predictions, to avoid overfitting. I will also analyze the pass/fail percentages as a function of “n”, the n-gram from which the prediction was actually plucked. This will help me determine the optimal size of my ngram dataset. Lastly, I will do tests comparing the response time, accuracy, and size of the dataset in RAM. Since my model is based on a sample of training text, I get to determine how big that sample is, and I know that a bigger sample will lead to (a) a bigger model in terms of bytes, (b) a slower response because of searching that model, and (c) a more accurate response. The assignment does not directly put limits on these things but I believe that if I view them all in a table I will be able to find a point of diminishing returns and decide on an optimal size for the training set.