The goal of this project is just to display that I have gotten used to working with the data and that I am on track to create the prediction algorithm.
This document concise and explain only the major features of the data I have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
The motivation for this project is to:
Creating a basic report of summary statistics about the data sets. Report any interesting findings that I amassed so far.
You can also embed plots, for example:
## [1] "Size of the News dataset is 196 MB"
## [1] "Size of the Blogs dataset is 200 MB"
## [1] "Size of the Twitter dataset is 159 MB"
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Basic Summaries
## [1] "Memory usage for News dataset is 261759048 MB"
## [1] "Memory usage for Blogs dataset is 260564320 MB"
## [1] "Memory usage for Twitter dataset is 316037600 MB"
## [1] "Memory usage for News dataset is 1010242 MB"
## [1] "Memory usage for Blogs dataset is 899288 MB"
## [1] "Memory usage for Twitter dataset is 2360148 MB"
## [1] "The longest line in the News dataset is 11384"
## [1] "The longest line in the Blogs dataset is 40833"
## [1] "The longest line in the Twitter dataset is 140"
N-Grams Histogram