Introduction

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.

The data has been previously downloaded from a previous excercise, so I will skip the download step in the process.

Getting the data

I will now read-in the data sets from the three data sources; blogs, news, and Twitter.

We will now examine the data set and summarize findings related to (file size, line counts, work counts and mean words per line).

   source file.size.MB num.lines num.words mean.num.words
1   blogs     200.4240    899288  37546246       41.75108
2    news     196.2775   1010242  34762395       34.40997
3 twitter     159.3640   2360148  30093409       12.75064

Cleaning data

Prior to performing any type of exploratory analysis, we must first clean the data which involves removing unneccesary words or characters such as; URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and the changing of text to lower case. For the sake of optimizing memory, we will randomly choose 1% of the data to demonstrate the cleaning and exploratory data analysis process.

Exploratory analysis

After getting and cleaning the data, we are now ready to perform exploratory analysis functions on our data set. We will list the most common unigrams, bigrams and trigrams.

Histogram of unigrams in data sample

Histogram of bigrams in data sample

Histogram of trigams in data sample