Introduction

This is the Milestone Report from the Coursera Data Science Capstone Project. The project involves building a predictive model of English text (part of the Natural Language Processing and Text Mining).

The Milestone Report is a deliverable of Week 2 (Exploratory Data Analysis and Modeling). The primary aim of this Milestone Report is to demonstrate ability to work with the data (the three .txt files named ‘blogs’, ‘news’ and ‘twitter’) and being on track to create the prediction algorithm.

The analysis in this report is displayed using:

Data Source

The training datasets for this study consists of the following .txt files in its subdirectory. The model will be trained based on this collection.

The source is provided by SwiftKey Click here for the link.

Load Libraries and Data

The relevant data were loaded from the respective text files, blogs, news and twitter. All requisite runtime libraries were also loaded.

Blogs data file was first loaded.

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...

News data file was loaded next.

##  chr [1:1010242] "He wasn't home alone, apparently." ...

Twitter data file was finally loaded.

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

Overview of Datasets

Main Dataset Comparison Statistics

The key information of each of the datasets, blogs, news and twitter, are summarized below:

** The Main Datasets **
FileName MaxCharacters File.Size FileSizeinMB Lines LinesNEmpty Chars CharsNWhite WordCount
blogs 40833 248.5 Mb 200.4242 899288 899288 206824382 170389539 37570839
news 11384 249.6 Mb 196.2775 1010242 1010242 203223154 169860866 34494539
twitter 140 301.4 Mb 159.3641 2360148 2360148 162096241 134082806 30451170

Data Subsets Comparison Statistics

Subsets of the main data files were created for seamless comparison and the key information are summarized below:

** The Main Datasets and Sub-Datasets **
File File.Size Nentries TotalCharacters MaxCharacters
blogs 248.5 Mb 899288 206824505 40833
news 249.6 Mb 1010242 203223159 11384
twitter 301.4 Mb 2360148 162096241 140
Blogs_subset 0.5 Mb 1798 402996 2751
News_subset 0.5 Mb 2020 408182 983
twitter_subset 0.6 Mb 4720 325001 140
subset_blog_news_twitter 1.6 Mb 8538 1148667 2209

Corpus process

Initial Data Cleanup

A corpus was created from the subsets for some data clean-up activities as outlined below:

  • Convert all words to lowercase
  • Eliminate punctuation
  • Eliminate numbers
  • Strip whitespace
  • Create Plain Text Format

Tokenize

Breaking a Stream of Texts into Words or Short Phrases

The next step was to Tokenize the samples and construct matrices of Unigrams, Bigrams and Trigrams. Then, the clean dataset was converted to a Natural Langugage Processing (NLP) usable format.

** One word **
word frequency
ability ability 16
able able 59
about about 559
above above 24
absolutely absolutely 24
accept accept 13
** Two words **
word frequency
a better a better 18
a big a big 34
a bit a bit 42
a car a car 15
a chance a chance 23
a couple a couple 31
** Three words**
word frequency
a chance to a chance to 15
a couple of a couple of 26
a little bit a little bit 16
a lot of a lot of 60
according to the according to the 12
all of the all of the 14

Calculate Frequencies of N-Grams

Frequency of Occurrence of Words or Short Phrases

Next, the most frequently occurring words in the data were identified and plotted in charts representing the unigrams, bigrams and trigrams.

Wordclouds

Alternative Visualization of the Main Words

As an alternative to the plots, and to give a quick impression of the most common words, the wordcloud shows the most common words of the corpus.

First is an interactive wordcloud for Trigrams Token (a mouse hover over each phrase will show the count of times it was found to be occurring in the Token.)

Most Frequent Words in Trigram Token


Next are static wordclouds for the other two Tokens - Unigram and Bigram.

Most Frequent Words in Unigram and Bigram Tokens


#### Overall, the total time taken for the entire processing was calculated as given below:

## [1] "Total Processing Time:  4  minutes"

Next Steps

The next steps will be to:

End of Report