Introduction

This report summarizes the initial exploratory data analysis for the Capstone project. The dataset includes text from blogs, news, and Twitter sources.

Loading and Summary Statistics

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

library(stringi)

# Summary
length(blogs)
## [1] 899288
length(news)
## [1] 1010206
length(twitter)
## [1] 2360148
sum(stri_count_words(blogs))
## [1] 37546806
sum(stri_count_words(news))
## [1] 34761151
sum(stri_count_words(twitter))
## [1] 30096690