The goal of this project is to create a n-gram model, A n-gram model is a type of probabilitstic language model, for predicting next word or words base on given input. Before using n-gram to find the correlation and relationships of word, We will conduct a simple exploratory analysis of input text.
This report shows the frequency of words, lot of time we interested in what is mentioned amount set of text, for example in newspaper, what word is used more than others. The input data we use is provided by SwiftKey, there are 3 files in English we will use, news, blogs, and tweeter.
Link to all archives is here. Data consists of sentences from three different sources such as news, twitter and blogs. ### Basic summary
Lines and words count in each file:
## [1] 899288
## [1] 36646432
## [1] 77259
## [1] 2607215
## [1] 2360148
## [1] 29485282
Setting seed for reproducible research.
set.seed(3999)
Since datasets are too big, it’s wise to split them into samples.
Using NGramTokenizer we can split our data into tokens we can use in prediction model
## Var1 Freq
## 56866 of the 792
## 40926 in the 743
## 86024 to the 357
## 57881 on the 288
## 30949 for the 262
## 7585 and the 250
How let’s make some visualizations
WordCloud can show most popular words in our combines sample
And some barplots for 2 and 3 words combinations
So far, I’ve downloaded data, splitted it into words and grams. These objects can be used to make prediction models.