Data Science Capstone Milestone Report

Introduction

This is the Milestone Report from the Coursera Data Science Capstone Project. The project involves building a predictive model of English text (part of the Natural Language Processing and Text Mining).

The Milestone Report is a deliverable of Week 2 (Exploratory Data Analysis and Modeling). The primary aim of this Milestone Report is to demonstrate ability to work with the data (the three .txt files named ‘blogs’, ‘news’ and ‘twitter’) and being on track to create the prediction algorithm.

The analysis in this report is displayed using:

Dataset Comparison Tables
Barcharts showing Most Frequently Occurring Words in each n-gram
Interactive Wordcloud showing Most Frequently Occurring Words in Trigram (with the count of each phrase displayed on mouse hover)
Static Wordcloud showing Most Frequently Occurring Words for the other two - Unigram and Bigram

Data Source

The training datasets for this study consists of the following .txt files in its subdirectory. The model will be trained based on this collection.

Blog: en_US.blogs.txt
News: en_US.news.txt
Twitter: en_US.twitter.txt

The source is provided by SwiftKey Click here for the link.

Load Libraries and Data

The relevant data were loaded from the respective text files, blogs, news and twitter. All requisite runtime libraries were also loaded.

Blogs data file was first loaded.

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods." ...

News data file was loaded next.

##  chr [1:1010242] "He wasn't home alone, apparently." ...

Twitter data file was finally loaded.

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

Overview of Datasets

Main Dataset Comparison Statistics

The key information of each of the datasets, blogs, news and twitter, are summarized below:

** The Main Datasets **
FileName	MaxCharacters	File.Size	FileSizeinMB	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount
blogs	40833	248.5 Mb	200.4242	899288	899288	206824382	170389539	37570839
news	11384	249.6 Mb	196.2775	1010242	1010242	203223154	169860866	34494539
twitter	140	301.4 Mb	159.3641	2360148	2360148	162096241	134082806	30451170

Data Subsets Comparison Statistics

Subsets of the main data files were created for seamless comparison and the key information are summarized below:

** The Main Datasets and Sub-Datasets **
File	File.Size	Nentries	TotalCharacters	MaxCharacters
blogs	248.5 Mb	899288	206824505	40833
news	249.6 Mb	1010242	203223159	11384
twitter	301.4 Mb	2360148	162096241	140
Blogs_subset	0.5 Mb	1798	402996	2751
News_subset	0.5 Mb	2020	408182	983
twitter_subset	0.6 Mb	4720	325001	140
subset_blog_news_twitter	1.6 Mb	8538	1148667	2209

Corpus process

Initial Data Cleanup

A corpus was created from the subsets for some data clean-up activities as outlined below:

Convert all words to lowercase
Eliminate punctuation
Eliminate numbers
Strip whitespace
Create Plain Text Format

Tokenize

Breaking a Stream of Texts into Words or Short Phrases

The next step was to Tokenize the samples and construct matrices of Unigrams, Bigrams and Trigrams. Then, the clean dataset was converted to a Natural Langugage Processing (NLP) usable format.

** One word **
	word	frequency
ability	ability	16
able	able	59
about	about	559
above	above	24
absolutely	absolutely	24
accept	accept	13

** Two words **
	word	frequency
a better	a better	18
a big	a big	34
a bit	a bit	42
a car	a car	15
a chance	a chance	23
a couple	a couple	31

** Three words**
	word	frequency
a chance to	a chance to	15
a couple of	a couple of	26
a little bit	a little bit	16
a lot of	a lot of	60
according to the	according to the	12
all of the	all of the	14

Calculate Frequencies of N-Grams

Frequency of Occurrence of Words or Short Phrases

Next, the most frequently occurring words in the data were identified and plotted in charts representing the unigrams, bigrams and trigrams.

Wordclouds

Alternative Visualization of the Main Words

As an alternative to the plots, and to give a quick impression of the most common words, the wordcloud shows the most common words of the corpus.

First is an interactive wordcloud for Trigrams Token (a mouse hover over each phrase will show the count of times it was found to be occurring in the Token.)

Most Frequent Words in Trigram Token

Next are static wordclouds for the other two Tokens - Unigram and Bigram.

Most Frequent Words in Unigram and Bigram Tokens

#### Overall, the total time taken for the entire processing was calculated as given below:

## [1] "Total Processing Time:  4  minutes"

Next Steps

The next steps will be to:

build a predictive model that employs an n-gram model with a frequency lookup similar to this work.
put together everything and deploy in a Shiny app, which recommends the likely next word after a phrase is typed.