## [1] 200.4242
## [1] 159.3641
## [1] 196.2775
## [1] 899288
## Length Class Mode
## 899288 character character
## [1] 2360148
## Length Class Mode
## 2360148 character character
## [1] 77259
## Length Class Mode
## 77259 character character
Since the dataset is very big, i can not use whole data to train my model. i am taking only the 10% of the data to train my model
## Length Class Mode
## 89929 character character
## Length Class Mode
## 236015 character character
## Length Class Mode
## 7726 character character
Note: i am not removing stop words because stop words helps in connecting two words in a sentence and this is one of the important factor in better performance of our model.
As we can see all top 15 frequent words are stop words and the word “the” is the most frequent word in all 3 datasets