Data Science Capstone

Milestone Report

Introduction

The Coursera Data Science Capstone project is to build a well performing text predictive model. This Milestone Report serves as a progress report achieving the goal of exploring the data and creating a fair prediction algorithm.

Data Statistics

The dataser is from a corpus called HC Corpora.

File	Size (bytes)	#Lines	#Words
en_US.blogs.txt	210,160,014	899,288	37,272,578
en_US.news.txt	205,811,889	1,010,242	34,309,642
en_US.twitter.txt	167,105,338	2,360,148	30,341,028

Data Cleaning

Before tokenizing the corpora, we cleaned the datas by the following transofrmations:

Removing numbers, punctuation and extra spaces.
Optionally removing profanity words.
Converting all letters into lowercase.

Data Analysis

Here we plot three n-grams for data visulization:

Top 30 BiGram	Top 30 TriGram	Top 30 Quadgram

Next Step

Add additional n-grams, 5-gramss and 6-grams.
Create Shiny application.