Summary.utf8

Twitter Dataset

The tokenizer was used while removing the stopwords. The barplot of 50 most frequent words is shown.

Summary statistics

Number of words

## [1] 17503728

Number of lines

## [1] 2360148

Barplot of Bi-grams

As expected, ‘happy birthday’ is among the top bigrams on twitter. Other bigrams also are along expected lines.

Blogs Dataset

Summary statistics

Number of words

## [1] 20358576

Number of lines

## [1] 899288

News Dataset

Summary statistics

Number of words

## [1] 1602667

Number of lines

## [1] 77259

Bi-grams Plot

The bi-grams in news dataset are also along expected lines.

Prediction Algorithm

The datasets will be divide into training (70%) and test set. The next word will be predicted using Markov assumption. Dependency on 1 and 2 previous words will be considered. For smoothing, backoff can be used. The probabilities will be multiplied and then taken log to prevent numerical underflow. The perplexity can be considered to evaluate the effectiveness of the model.