The tokenizer was used while removing the stopwords. The barplot of 50 most frequent words is shown.
Number of words
## [1] 17503728
Number of lines
## [1] 2360148
As expected, ‘happy birthday’ is among the top bigrams on twitter. Other bigrams also are along expected lines.
Number of words
## [1] 20358576
Number of lines
## [1] 899288
Number of words
## [1] 1602667
Number of lines
## [1] 77259
The bi-grams in news dataset are also along expected lines.
The datasets will be divide into training (70%) and test set. The next word will be predicted using Markov assumption. Dependency on 1 and 2 previous words will be considered. For smoothing, backoff can be used. The probabilities will be multiplied and then taken log to prevent numerical underflow. The perplexity can be considered to evaluate the effectiveness of the model.