The goal of the Capstone Project is to provide an accurate, well performing text predictive model to be used for the Data Science Capstone Data Product. In this milestone report,we will explore the major features of the data and briefly summarize the plan for creating the prediction algorithm.
We loaded data from coursera.After we unzip the cousera-swiftkey.zip file,we found it contains four language catagory:Russian,Germany,France,and English.We just use english data to do the trainning.The english data has three text files:blogs,news and tweets.we found some basic information of these files:
## File Size (MB) Lines Words
## 1 News 196.2775 1010242 34503984
## 2 Blogs 200.4242 899288 37336707
## 3 Twitter 159.3641 2360148 30511885
Due to the large size of these files,we used sample data to do the explore, and we have to clean the data by using tm package.
After we tokenize the data,build a N-grams and do some exploratory data analysis.
Unigrams
bigrams
Trigrams
Quadgrams
It is necessary to do some further research of the relationship between words,plan to build a basic n-gram predict model and choose the suitable algorithms,it may lead to build a successful shiny application.