The goal of this report is to explore the text data provided for the Coursera Data Science Capstone project. This analysis helps understand the structure and size of the data before building a next-word prediction model.
The data consists of three text sources: - Blogs - News - Twitter
These datasets contain large amounts of unstructured English text.
The following statistics summarize the size of each dataset.
blogs_lines <- 899288
news_lines <- 77259
twitter_lines <- 2360148
data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(blogs_lines, news_lines, twitter_lines)
)
## Source Lines
## 1 Blogs 899288
## 2 News 77259
## 3 Twitter 2360148