Introduction
This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.
The data has been previously downloaded from a previous excercise, so I will skip the download step in the process.
I will now read-in the data sets from the three data sources; blogs, news, and Twitter.
We will now examine the data set and summarize findings related to (file size, line counts, work counts and mean words per line).
source file.size.MB num.lines num.words mean.num.words
1 blogs 200.4240 899288 37546246 41.75108
2 news 196.2775 1010242 34762395 34.40997
3 twitter 159.3640 2360148 30093409 12.75064
Prior to performing any type of exploratory analysis, we must first clean the data which involves removing unneccesary words or characters such as; URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and the changing of text to lower case. For the sake of optimizing memory, we will randomly choose 1% of the data to demonstrate the cleaning and exploratory data analysis process.
After getting and cleaning the data, we are now ready to perform exploratory analysis functions on our data set. We will list the most common unigrams, bigrams and trigrams.
Histogram of unigrams in data sample
Histogram of bigrams in data sample
Histogram of trigams in data sample