"Binod Jung Bogati"
"6/20/2018"
This is a milestone report for data science capstone project which analyzes the HC Corpora Dataset.
The main goal of project is to create a data product with word prediction. '
This report summarizes the exploratory data analysis of the project.
There were data files in four languages sourced from blogs, news, and twitter. We select the en_US data and read into R. The summary of the file is given below.
| file_names | file_size | file_lines | num_of_char | num_of_words |
|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 206824505 | 37334131 |
| news | 196.2775 | 1010242 | 203223159 | 34372530 |
| 159.3641 | 2360148 | 162096241 | 30373583 |
We have sampled 10% of the lines from each file. It covers 90% of the sample phrases.
In data, we've also found popular words are “said”, “just”, “like” along with abbreviative words like “im”, “ive”, “dont”.
Wordcloud
The file's (blogs, news, twiter) relative word frequencies varies.
Distribution of each set of n-grams, based on relative frequency.