Peer-graded Assignment: Milestone Report

Introduction

This is a report generated as a part of the week 2 peer review assignment of the Data Science Capstone course by Johns Hopkins University on Coursera. The purpose is to demonstrate that the author has downloaded the data successfully and has performed preliminary exploratory analysis. Complete RMD file can be found here

Summary of Dataset

A summary of all three data sources used for this assignment is presented below with information on file size, line and word counts. It shows that the blog data has the largest file size with the highest word counts and the lowest line counts while twitter is the exact opposite of it with lowest word count and highest line counts. This manifests to a degree how the users’ writing pattern differs across platform.

	File size, Mb	Lines	Words
Blogs	200.42	899288	37546239
News	196.28	1010242	34762395
Twitter	159.36	2360148	30093413

Word Clouds

While being fully aware that a word cloud composed of single words most likely won’t provide any useful insight on the natural language of interest, the author still decided to feature one because it looks pretty on a report. Sorry for being such a noob, J. Harris. :,)

In order to speed up the analytical process, the above word cloud was generated using only 1% of the original dataset.

Tokenization

The same sampled 1% of the combined original dataset was also used for tokenizaton and the top 15 frequency distribution of the unigram, bigram and trigram are shown below.

Interesting Findings and Thoughts

The package used for tokenization makes such a huge difference when it comes to how much time is required to generate the word-grams. The tm+RWeka approach used by many is the slowest for me while tidytext only takes around one minute user time to present the final results. This report is generated with unnext_token() that can be found with tidytext package. The stop words and NAs generated while cleaning have been removed, however, I do think that when working on generating a predictive model, stop words should still be included as they are the most commonly observed instances in English and therefore should always be suggested to the users as part of the auto-complete function.