The data itself consists of three sets pulled from news, blogs and twitter posts. These files have line counts, which are required to be represented here. tada:
## [1] "2360148 en_US.twitter.txt"
## [1] "1010242 en_US.news.txt"
## [1] " 899288 en_US.blogs.txt"
Of course, words too:
## [1] "30373583 en_US.twitter.txt"
## [1] "34372530 en_US.news.txt"
## [1] "37334131 en_US.blogs.txt"
voila. Now for interesting things
It might be helpful to test a number of hypotheses in the exploratory analysis. Conceptually, several intuitions might seem likely to obtain in the problem of predicting text. Figuring out which, if any, actually do may go a long way in helping us to chose the correct approach in prediction.
First, one might assume that there is for at least small horizons, an increasing relationship between predictability and memory, that is, given four words it may be easier to predict the fifth than it is given one word. In a very simple/restrictive way we can test this by estimating the mean differentiation on the xth word in consecutive strings. (we should break this by line, as lines seem to delineate entries. Also, heavy use of sampling throughout helps improve execution time)
The below series of histograms plots the evolution of unique predictions given sequences of increasing length. I.E. given one word, how many unique words follow in the sample, given two words, given three. The mean count of unique following words is printed below each histogram. The decreasing mean implies that for any given set of three consecutive words, the mean set of possible next words is smaller than the mean set of possible third words given a set of two consecutive words. This follows from intuition:
## [1] 3.028694
## [1] 1.175341
## [1] 1.015769
generally speaking we observe the number of unique words in a squence defined by a given set of previous words decline as the set of previous words increases. Of course this leads to obvious overfitting when allowed to continue arbitrarily. For a clear example of this merely observe this sentence. It may be very likely that many the phrase “For a” is followed by many third words in a given dataset, so it’s likely the above analysis would project that a prediction of the third word would be weak. However, although the phrase “For a clear example of this merely observe this” may appear only once, it’s unlikely that we’re perfectly correct in projecting “sentence” as the following word.
Continuing analysis will require, of course, extending the anaylsis to all three datasets. As well as the handling of typos, numerals, and at least periods as far as punctuation. In the context of making a simply shiny app that users might only interact with once, it is likely useless to attempt to categorize writing styles and change predictions according, and any sort of deep learning on the user’s behavior would necessarily suffer from insufficient information. (although both of these analyses would be intriguing to do)