This report describes our initial exploratory analysis of the training data set we will be using to build an App that predicts the next word of a sentence. In this initial exploration, we download the data from files containing twitter posts, blog posts, and news posts. Next we examine each file by size, word count, and line count. Finally, we display some basic summary statistics for each to get a general sense of the difference between them and display a word cloud.

Recognizing that this will be a computationally intensive project for r, we utilize the RevoUtilsMath: Microsoft R Services Math Utilities Package, version 3.2.3. Not only will this upgrade allow one to engage all cores of an intel processor, but perhaps of greater benefit is the ability it provides to link with the Intel® Math Kernel Library (Intel® MKL). This Revolution Analytics blog post (now known as the Microsoft R Application Network) describes how the library accelerates math processing routines with an emphasis on faster calculations involving matrix algebra. As the Term Document Matrix is at the heart of quantitative corpus linguistics, this library will reduce our testing and development time, allowing for the exploration of algorithms that would otherwise prove impractical to implement in a reasonable amount of time.

Data

First, link to the intel Math Kernel Library and engage all available cores of the processor. Next, download the data and successfully load it into R. The following scripts do this.

library(RevoUtilsMath)
cores <- getMKLthreads()
setMKLthreads(cores)
# Define file and destination paths
file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest <- path.expand("~/Misc/R/Capstone/Swiftkey.zip")
# Download file and unzip
download.file(file, dest)
unzip(dest)
# Load files into R
library(RTextTools)
blog_file <- paste(getwd(),"final", "en_US", "en_US.blogs.txt", sep="/")
news_file <- paste(getwd(),"final", "en_US","en_US.news.txt", sep="/")
twitter_file <- paste(getwd(),"final", "en_US","en_US.twitter.txt", sep="/")

blogs   <- readLines(blog_file)
newsl    <- readLines(news_file)
twitter <- readLines(twitter_file)

Summary Statistics

Second, I provide some basic summary statistics. Line counts, word counts, and file size are displayed in a table below for the twitter, blogs, and news file.

File_Name Size_.MB.. Number_of_Lines Word_Count
1 News 196.28 77259 2665742
2 Blogs 200.42 899288 37865888
3 Twitter 159.36 2360148 30578891

What about the composition of twitter posts, blogs, and news articles? Here are some summary statistics on the number of characters. We can see blogs have the widest distribution of characters, while twitter has the lowest. Bloggers sometimes rant and rave, while twitter users are constrained to a life of 140 character tweets, so both figures make sense.

Minimum 1st Qu. Median Mean 3rd Qu. Maximum
Blogs 1.00 47.00 157.00 231.70 331.00 40840.00
News 2.00 111.00 186.00 203.00 270.00 5760.00
Twitter 2.00 37.00 64.00 68.80 100.00 213.00

Create a Document Term Matrix and count words

Next, I create a Corpus, removing all white space, numbers, punctuation, and common english words such as “and”, “for”, “in”, “is”, “it”, “not”, “the”, “to” etc. I also transform all words to lower case so that capitalized words don’t result in duplicate counts for the same word. For example, if we don’t transform all words to lower case, than our program will count Models as a distinct word separate from models. These pre-processing steps are import for creating a robust Document Term Matrix.

At the heart of most text mining operations is a Document Term Matrix. R's ability to handle giant matrix computations is one of the things that really sets the language apart from the user friendly, but extremely limited, Microsoft Excel. Once one constructs a Document Term Matrix, one can then count words frequency using the rowSums function. Then, one can use the sort function to rank each word from highest number of appearances to lowest.

Lets view the top 10 words, to get some flavor for the data.

word frequency Percent
said 19164.00 1.26
will 8463.00 0.56
one 6387.00 0.42
new 5326.00 0.35
also 4515.00 0.30
two 4430.00 0.29
can 4393.00 0.29
year 4218.00 0.28
first 4147.00 0.27
just 4126.00 0.27

Data Illustration: The Word Cloud

The word cloud is an excellent visual tool for displaying quantitative features of a corpus in ways that traditional plots do not. We define a color palette that colors word levels based on how many times they appears in our sample corpus. In addition, the size of each word also varies based on its frequency of appearance. I set the minimum frequency of the world cloud to 1000, meaning that each word displayed below appears in the corpus at least 1000 times. The news file of the corpus was used as a sample, and as one can read, “news speak” is on full display with the word said dominating the conversation.

Next, I will explore the suite of predictive analytics models available through the RTextTools package. The models are both powerful and easy to implement. Depending on the results, I will also explore writing custom functions to better interface the Document Term Matrix with the many statistical learning algorithms available through the R statistical computing environment. In summary, not only did I consider the data, but also built a strong computational foundation for optimizing the predictive algorithms we will rely on to successfully complete the project.