File size in MB

Blog

## [1] 200.4242

Twitter

## [1] 159.3641

News

## [1] 196.2775

Length and summary

Blog

## [1] 899288
##    Length     Class      Mode 
##    899288 character character

Twitter

## [1] 2360148
##    Length     Class      Mode 
##   2360148 character character

News

## [1] 77259
##    Length     Class      Mode 
##     77259 character character

Since the dataset is very big, i can not use whole data to train my model. i am taking only the 10% of the data to train my model

Summary of sample data

Blog

##    Length     Class      Mode 
##     89929 character character

Twitter

##    Length     Class      Mode 
##    236015 character character

News

##    Length     Class      Mode 
##      7726 character character

15 Most frequent words in each dataset

Note: i am not removing stop words because stop words helps in connecting two words in a sentence and this is one of the important factor in better performance of our model.

Blog

twitter

news

As we can see all top 15 frequent words are stop words and the word “the” is the most frequent word in all 3 datasets