The dataset

There are three different sources in the dataset. From news, blogs and twitter.

Which is good, the source might come from several different origins since the begining and it will be able to increase with more sources, including the users own type.

## data length
##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##             20000             20000             20000
## number of words
##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##            847632            714315            262941
## number of unique words
##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##             42267             41167             22546

It might get huge in size and require a lot of computation. So the best algorithm might not be the most accurate, but one that uses not much resources.

The increment benefit

Some words are more common than others. Connectives, pronoums, adjetives occurs more frequently than some obscure substantives or compound words.

But they all costs to analyse. Make a good cut-off is a key point in this business.

My approach is to find the derivative of previous chart and cut all values < 10. Those cases are the words wich their contribution are less worth than the cost to process them.

It might be a future parameter to the model. It will be possible to analyse the cost of going deeper in the dataset or to remain in the most frequent words.

The pair frequency

It will be used 5656 words to train the model. Wich seem pretty affordable.

Here is the top 20.

##    token count
## 1    the 88132
## 2     to 47978
## 3    and 45893
## 4      a 42877
## 5     of 37689
## 6      i 31483
## 7     in 29727
## 8   that 20160
## 9     it 19664
## 10     s 18846
## 11    is 18518
## 12   for 18203
## 13   you 14548
## 14    on 13770
## 15  with 12957
## 16   was 11742
## 17    at  9750
## 18  this  9517
## 19    he  9298
## 20    be  9190