The goal of this project is just to display that I have gotten used to working with the data and that I am on track to create the prediction algorithm.

This document concise and explain only the major features of the data I have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

The motivation for this project is to:

Creating a basic report of summary statistics about the data sets. Report any interesting findings that I amassed so far.

You can also embed plots, for example:

## [1] "Size of the News dataset is  196 MB"
## [1] "Size of the Blogs dataset is  200 MB"
## [1] "Size of the Twitter dataset is  159 MB"

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Basic Summaries

## [1] "Memory usage for News dataset is  261759048 MB"
## [1] "Memory usage for Blogs dataset is  260564320 MB"
## [1] "Memory usage for Twitter dataset is  316037600 MB"
## [1] "Memory usage for News dataset is  1010242 MB"
## [1] "Memory usage for Blogs dataset is  899288 MB"
## [1] "Memory usage for Twitter dataset is  2360148 MB"
## [1] "The longest line in the News dataset is  11384"
## [1] "The longest line in the Blogs dataset is  40833"
## [1] "The longest line in the Twitter dataset is  140"

N-Grams Histogram