Coursera Data Science Capstone Milestone Project

The purpose of this project is to create an application that is capable of predicting the next word from user text. If for example user types “Hello how are you” the predicted word might be “today”. There are a lot of applications for such a data product, especially in mobile area since typing is not an easy thing to do.

We are going to use three sources of data for that product:

Twitter Tweets
Blog Postings
News Articles

These data can be downloaded from:

(https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

Following text analyzes first steps for creating such a product.

Basic Data Analysis

After downloading the data, we perform some analysis in order to have a first idea about sizes, number of lines etc.

##             Size   Lines
## Blog    200.4242  899288
## News    196.2775 1010242
## Twitter 159.3641 2360148

Only english documents will be considered for this product. Looking at random lines for each document, we see that there are a lot of non english characters for the twitter data which will be removed at a later stage.

Data Transformation

In this stage data documents will be converted to a Corpus, and all the necessary transformations will be performed. Stop words will not be removed since they play a major role in prediction. So we are going to apply the following transformations.

Convert to lower case
Remove spaces
Remove punctuation
Remove number and currency
Stem words
Remove some sparsity

Also we are going to work with a small sample of the data, since the computations take a long to complete and they are adequate at this stage.

After that, we create a term document matrix. The first 10 lines of this matrix is as follows:

##    word  freq
## 1   the 44168
## 2   and 23008
## 3  that  9488
## 4   for  9126
## 5  with  6703
## 6   you  6479
## 7   was  6010
## 8  this  4694
## 9  have  4470
## 10  but  4384

The following chart displays the 20 most frequent words for our sampled data:

Word cloud

The most common words appear bigger on the picture.

Future Plan

In order to create the shiny app there are a number of requirements as follows:

Learn more about n-grams and Markov chains
Implement the backoff algorithm
Remove profanity words

Milestone Report

KT

March, 2015

Coursera Data Science Capstone Milestone Project

Basic Data Analysis

Data Transformation

Word cloud

Future Plan