There are three different sources in the dataset. From news, blogs and twitter.
Which is good, the source might come from several different origins since the begining and it will be able to increase with more sources, including the users own type.
## data length
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 20000 20000 20000
## number of words
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 847632 714315 262941
## number of unique words
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 42267 41167 22546
It might get huge in size and require a lot of computation. So the best algorithm might not be the most accurate, but one that uses not much resources.
Some words are more common than others. Connectives, pronoums, adjetives occurs more frequently than some obscure substantives or compound words.
But they all costs to analyse. Make a good cut-off is a key point in this business.
My approach is to find the derivative of previous chart and cut all values < 10. Those cases are the words wich their contribution are less worth than the cost to process them.
It might be a future parameter to the model. It will be possible to analyse the cost of going deeper in the dataset or to remain in the most frequent words.
It will be used 5656 words to train the model. Wich seem pretty affordable.
Here is the top 20.
## token count
## 1 the 88132
## 2 to 47978
## 3 and 45893
## 4 a 42877
## 5 of 37689
## 6 i 31483
## 7 in 29727
## 8 that 20160
## 9 it 19664
## 10 s 18846
## 11 is 18518
## 12 for 18203
## 13 you 14548
## 14 on 13770
## 15 with 12957
## 16 was 11742
## 17 at 9750
## 18 this 9517
## 19 he 9298
## 20 be 9190