In this document we will quickly explore the 3 english datasets provided for the SwiftKey capstone project. These files include en_US.twitter.txt, en_US.news.txt, and en_US.blogs.txt. As the names suggest, these datasets contain tweets, and articles derived from twitter, blogs, and various news sources. The raw capstone dataset can be found here.

In this report, we will explore basic features in the dataset including: * the total number of articles per dataset * the distribution of word counts * the distribution of article lengths * the top 10 most common words in each dataset

Data Overview

The raw english dataset is collected from three sources: tweets, blogs, and news articles. A brief breakdown of each of these datasets can be seen below:

##            filename num_lines num_tokens
## 1 en_US.twitter.txt   2360148   15421354
## 2   en_US.blogs.txt    899288   17585245
## 3    en_US.news.txt   1010242   18252511

We can see that each of these datasets include over 15MM tokens each. What is a token you may ask? We can define a token as any sequence of characters buffered by whitespace. For example, in the line “I like 3 types of mustard”, we can break it down into the tokens [“I”, “like”, “3”, “types”, “of”, “mustard”].

Furthermore, we can see that blogs and news article tend to have a similar rate of ‘tokens-per-article’ while tweets are significantly smaller. This makes sense considering tweets are constrained by a character limit.

##            filename tokens.per.article
## 1 en_US.twitter.txt              6.534
## 2   en_US.blogs.txt             19.555
## 3    en_US.news.txt             18.067

We can better visualize the distribution of tokens per medium via histograms. For the purposes of visualization, the logarithm of total tokens per article is taken. This trick is used to help visualize the distribution of tokens per article, which normally would be skewed to the left. For illustrative purposes, have a look at the histogram of tokens in news articles: