Milestone Report

Introduction

The Coursera Data Science Capstone project is to build a well performing text predictive model. This Milestone Report serves as a progress report achieving the goal of exploring the data and creating a fair prediction algorithm.

Data Statistics

The dataser is from a corpus called HC Corpora.

File Size (bytes) #Lines #Words
en_US.blogs.txt 210,160,014 899,288 37,272,578
en_US.news.txt 205,811,889 1,010,242 34,309,642
en_US.twitter.txt 167,105,338 2,360,148 30,341,028

Data Cleaning

Before tokenizing the corpora, we cleaned the datas by the following transofrmations:

  1. Removing numbers, punctuation and extra spaces.

  2. Optionally removing profanity words.

  3. Converting all letters into lowercase.

Data Analysis

Here we plot three n-grams for data visulization:

Top 30 BiGram Top 30 TriGram Top 30 Quadgram
image image image

Next Step

  1. Add additional n-grams, 5-gramss and 6-grams.

  2. Create Shiny application.