Capstone: Milestone Report

"Binod Jung Bogati"
"6/20/2018"

Introduction

  • This is a milestone report for data science capstone project which analyzes the HC Corpora Dataset.

  • The main goal of project is to create a data product with word prediction. '

  • This report summarizes the exploratory data analysis of the project.

File Summary

There were data files in four languages sourced from blogs, news, and twitter. We select the en_US data and read into R. The summary of the file is given below.

file_names file_size file_lines num_of_char num_of_words
blogs 200.4242 899288 206824505 37334131
news 196.2775 1010242 203223159 34372530
twitter 159.3641 2360148 162096241 30373583

We have sampled 10% of the lines from each file. It covers 90% of the sample phrases.

Uni-grams

In data, we've also found popular words are “said”, “just”, “like” along with abbreviative words like “im”, “ive”, “dont”.

Wordcloud

Uni-gram Wordcloud

Uni-gram, by Source

The file's (blogs, news, twiter) relative word frequencies varies.

Uni-gram By Source

Uni-gram Distribution

Distribution of each set of n-grams, based on relative frequency.

Uni-gram distribution

Bi-gram Distribution

Bi-gram Distribution

Trigram distribution

Tri-gram Distribution

Quadgram distribution

Quad-gram Distribution

Prediction Model

  • Create n-gram tables for bi-gram, tri-grams, and quad-grams for prediction.
  • When user will input a word, model find the relative words.
  • Based on the data, we choose the n-gram model to give favorbale result.