The Coursera Data Science Capstone project is to build a well performing text predictive model. This Milestone Report serves as a progress report achieving the goal of exploring the data and creating a fair prediction algorithm.
The dataser is from a corpus called HC Corpora.
| File | Size (bytes) | #Lines | #Words |
|---|---|---|---|
| en_US.blogs.txt | 210,160,014 | 899,288 | 37,272,578 |
| en_US.news.txt | 205,811,889 | 1,010,242 | 34,309,642 |
| en_US.twitter.txt | 167,105,338 | 2,360,148 | 30,341,028 |
Before tokenizing the corpora, we cleaned the datas by the following transofrmations:
Removing numbers, punctuation and extra spaces.
Optionally removing profanity words.
Converting all letters into lowercase.
Here we plot three n-grams for data visulization:
| Top 30 BiGram | Top 30 TriGram | Top 30 Quadgram |
|---|---|---|
Add additional n-grams, 5-gramss and 6-grams.
Create Shiny application.