Capstone: Milestone Report

"Binod Jung Bogati"
"6/20/2018"

Introduction

This is a milestone report for data science capstone project which analyzes the HC Corpora Dataset.
The main goal of project is to create a data product with word prediction. '
This report summarizes the exploratory data analysis of the project.

File Summary

There were data files in four languages sourced from blogs, news, and twitter. We select the en_US data and read into R. The summary of the file is given below.

file_names	file_size	file_lines	num_of_char	num_of_words
blogs	200.4242	899288	206824505	37334131
news	196.2775	1010242	203223159	34372530
twitter	159.3641	2360148	162096241	30373583

We have sampled 10% of the lines from each file. It covers 90% of the sample phrases.

Uni-grams

In data, we've also found popular words are “said”, “just”, “like” along with abbreviative words like “im”, “ive”, “dont”.

Wordcloud

Uni-gram Wordcloud

Uni-gram, by Source

The file's (blogs, news, twiter) relative word frequencies varies.

Uni-gram By Source

Uni-gram Distribution

Distribution of each set of n-grams, based on relative frequency.

Uni-gram distribution

Bi-gram Distribution

Trigram distribution

Tri-gram Distribution

Quadgram distribution

Quad-gram Distribution

Prediction Model

Create n-gram tables for bi-gram, tri-grams, and quad-grams for prediction.
When user will input a word, model find the relative words.
Based on the data, we choose the n-gram model to give favorbale result.