Introduction

The goal of this project is to create a n-gram model, A n-gram model is a type of probabilitstic language model, for predicting next word or words base on given input. Before using n-gram to find the correlation and relationships of word, We will conduct a simple exploratory analysis of input text.

This report shows the frequency of words, lot of time we interested in what is mentioned amount set of text, for example in newspaper, what word is used more than others. The input data we use is provided by SwiftKey, there are 3 files in English we will use, news, blogs, and tweeter.

Downloading files

Link to all archives is here. Data consists of sentences from three different sources such as news, twitter and blogs. ### Basic summary

Lines and words count in each file:

  • Blogs
## [1] 899288
## [1] 36646432
  • News
## [1] 77259
## [1] 2607215
  • Twitter
## [1] 2360148
## [1] 29485282

Explanatory analyses

Setting seed for reproducible research.

set.seed(3999)

Since datasets are too big, it’s wise to split them into samples.

Using NGramTokenizer we can split our data into tokens we can use in prediction model

##          Var1 Freq
## 56866  of the  792
## 40926  in the  743
## 86024  to the  357
## 57881  on the  288
## 30949 for the  262
## 7585  and the  250

How let’s make some visualizations

WordCloud can show most popular words in our combines sample

And some barplots for 2 and 3 words combinations

Bigrams

Trigrams

Conclusion

So far, I’ve downloaded data, splitted it into words and grams. These objects can be used to make prediction models.