Introduction
This is the Milestone Report from the Coursera Data Science Capstone Project. The project involves building a predictive model of English text (part of the Natural Language Processing and Text Mining).
The Milestone Report is a deliverable of Week 2 (Exploratory Data Analysis and Modeling). The primary aim of this Milestone Report is to demonstrate ability to work with the data (the three .txt files named ‘blogs’, ‘news’ and ‘twitter’) and being on track to create the prediction algorithm.
The analysis in this report is displayed using:
- Dataset Comparison Tables
- Barcharts showing Most Frequently Occurring Words in each n-gram
- Interactive Wordcloud showing Most Frequently Occurring Words in Trigram (with the count of each phrase displayed on mouse hover)
- Static Wordcloud showing Most Frequently Occurring Words for the other two - Unigram and Bigram
Data Source
The training datasets for this study consists of the following .txt files in its subdirectory. The model will be trained based on this collection.
- Blog: en_US.blogs.txt
- News: en_US.news.txt
- Twitter: en_US.twitter.txt
The source is provided by SwiftKey Click here for the link.
Load Libraries and Data
The relevant data were loaded from the respective text files, blogs, news and twitter. All requisite runtime libraries were also loaded.
Blogs data file was first loaded.
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...
News data file was loaded next.
## chr [1:1010242] "He wasn't home alone, apparently." ...
Twitter data file was finally loaded.
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
Overview of Datasets
Main Dataset Comparison Statistics
The key information of each of the datasets, blogs, news and twitter, are summarized below:
** The Main Datasets **
| blogs |
40833 |
248.5 Mb |
200.4242 |
899288 |
899288 |
206824382 |
170389539 |
37570839 |
| news |
11384 |
249.6 Mb |
196.2775 |
1010242 |
1010242 |
203223154 |
169860866 |
34494539 |
| twitter |
140 |
301.4 Mb |
159.3641 |
2360148 |
2360148 |
162096241 |
134082806 |
30451170 |
Data Subsets Comparison Statistics
Subsets of the main data files were created for seamless comparison and the key information are summarized below:
** The Main Datasets and Sub-Datasets **
| blogs |
248.5 Mb |
899288 |
206824505 |
40833 |
| news |
249.6 Mb |
1010242 |
203223159 |
11384 |
| twitter |
301.4 Mb |
2360148 |
162096241 |
140 |
| Blogs_subset |
0.5 Mb |
1798 |
402996 |
2751 |
| News_subset |
0.5 Mb |
2020 |
408182 |
983 |
| twitter_subset |
0.6 Mb |
4720 |
325001 |
140 |
| subset_blog_news_twitter |
1.6 Mb |
8538 |
1148667 |
2209 |
Corpus process
Initial Data Cleanup
A corpus was created from the subsets for some data clean-up activities as outlined below:
- Convert all words to lowercase
- Eliminate punctuation
- Eliminate numbers
- Strip whitespace
- Create Plain Text Format
Tokenize
Breaking a Stream of Texts into Words or Short Phrases
The next step was to Tokenize the samples and construct matrices of Unigrams, Bigrams and Trigrams. Then, the clean dataset was converted to a Natural Langugage Processing (NLP) usable format.
** One word **
| ability |
ability |
16 |
| able |
able |
59 |
| about |
about |
559 |
| above |
above |
24 |
| absolutely |
absolutely |
24 |
| accept |
accept |
13 |
** Two words **
| a better |
a better |
18 |
| a big |
a big |
34 |
| a bit |
a bit |
42 |
| a car |
a car |
15 |
| a chance |
a chance |
23 |
| a couple |
a couple |
31 |
** Three words**
| a chance to |
a chance to |
15 |
| a couple of |
a couple of |
26 |
| a little bit |
a little bit |
16 |
| a lot of |
a lot of |
60 |
| according to the |
according to the |
12 |
| all of the |
all of the |
14 |
Calculate Frequencies of N-Grams
Frequency of Occurrence of Words or Short Phrases
Next, the most frequently occurring words in the data were identified and plotted in charts representing the unigrams, bigrams and trigrams.

Wordclouds
Alternative Visualization of the Main Words
As an alternative to the plots, and to give a quick impression of the most common words, the wordcloud shows the most common words of the corpus.
First is an interactive wordcloud for Trigrams Token (a mouse hover over each phrase will show the count of times it was found to be occurring in the Token.)
Most Frequent Words in Trigram Token
Next are static wordclouds for the other two Tokens - Unigram and Bigram.
Most Frequent Words in Unigram and Bigram Tokens

#### Overall, the total time taken for the entire processing was calculated as given below:
## [1] "Total Processing Time: 4 minutes"
Next Steps
The next steps will be to:
- build a predictive model that employs an n-gram model with a frequency lookup similar to this work.
- put together everything and deploy in a Shiny app, which recommends the likely next word after a phrase is typed.