Data Science Capstone:Milestone Report

Introduction

The goal of the Capstone Project is to provide an accurate, well performing text predictive model to be used for the Data Science Capstone Data Product. In this milestone report,we will explore the major features of the data and briefly summarize the plan for creating the prediction algorithm.

Data Description

We loaded data from coursera.After we unzip the cousera-swiftkey.zip file,we found it contains four language catagory:Russian,Germany,France,and English.We just use english data to do the trainning.The english data has three text files:blogs,news and tweets.we found some basic information of these files:

##      File Size (MB)   Lines    Words
## 1    News  196.2775 1010242 34503984
## 2   Blogs  200.4242  899288 37336707
## 3 Twitter  159.3641 2360148 30511885

Data sampling and cleaning

Due to the large size of these files,we used sample data to do the explore, and we have to clean the data by using tm package.

N-grams

After we tokenize the data,build a N-grams and do some exploratory data analysis.

Unigrams

bigrams

Trigrams

Quadgrams

Next Steps

It is necessary to do some further research of the relationship between words,plan to build a basic n-gram predict model and choose the suitable algorithms,it may lead to build a successful shiny application.