Coursera Data Science Capstone Project

Introduction

Portable office actually means the works done on the cellphone and the tablet and we need input system to saving our time on typing on them. So a smart and efficient keyboard is required and the core of this input system is a predictive text model. This milestone report is focused on this model, covering the very beginning, namely data collection, to exploratory analysis of the data set.

Data Collection

The data were downloaded from the course website (from HC Corpora) and unzipped to extract the English database as a corpus. Three text documents from the twitter, blog and news were found with each line standing for a message.

Load Data

It is assumed that the encoding of the dataset is UTF-8
Load each file one by one using readLines function

Summary

The basic summary of the orginal data set is shown as follows:

Summary of the datasets
Dataset	Lines	Chars	Words
blogsdoc	899288	206824382	37570839
newsdoc	1010242	203223154	34494539
twittersdoc	2360148	162096241	30451170

Data Cleansing

The data will be filtered by

1)Sampling the data by 1% of three documents(3 files), example code -sample(blogs_sample, length(blogs_sample) * 0.01.
2)the non-ASCII characters
3)change the capital characters to lower case
4)remove the punctuation
5)numbers
6)stop words
7)stemming the left words.
8)To decrease the spares of the term frequency

Tokenizer

The whole tokenization is aiming at removing meaningless characters and the words with low frequency in the corpus. The final corpus will show the words or n-gram with a high frequency which will be helpful for exploring the relationship between the words and building a manful statistical model.

Exploratory analysis

Figure 1 Histogram of nGrams(Top 10)

Figure 2 WordCloud of nGrams(Top 10)

Interest Findings

Scability. Feels like the real big data , in which running such scale of data in the desktop PC.
Integrity. The data cleansing step is very import to obtain accurate data.

Next Steps for the Prediction Application

As already noted, the next step of the capstone project will be to create a prediction application. To create a smooth and fast application it is absolutely necessary to build a fast prediction algorithm.
Thus, find ways for a faster processing of larger datasets is necessary. Therefore, evaluate the suitable is very important. In this project the following Algorithms will be evaluated :
Markov Assumption algorithm (https://en.wikipedia.org/wiki/Markov_property)
chain Rule
Katz’s Backoff Model (https://thachtranerc.wordpress.com/2016/04/12/katzs-backoff-model-implementation-in-r/)
smoothing technique (http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)
All in all a shiny application will be created which will be able to predict the next word a user wants to write.

Coursera Data Science Capstone Project - Milestone Report

Jack

October 19, 2016