This report describes our initial exploratory analysis of the training data set we will be using to build an App that predicts the next word of a sentence. In this initial exploration, we download the data from files containing twitter posts, blog posts, and news posts. Next we examine each file by size, word count, and line count. Finally, we display some basic summary statistics for each to get a general sense of the difference between them and display a word cloud.
Recognizing that this will be a computationally intensive project for r, we utilize the RevoUtilsMath: Microsoft R Services Math Utilities Package, version 3.2.3. Not only will this upgrade allow one to engage all cores of an intel processor, but perhaps of greater benefit is the ability it provides to link with the Intel® Math Kernel Library (Intel® MKL). This Revolution Analytics blog post (now known as the Microsoft R Application Network) describes how the library accelerates math processing routines with an emphasis on faster calculations involving matrix algebra. As the Term Document Matrix is at the heart of quantitative corpus linguistics, this library will reduce our testing and development time, allowing for the exploration of algorithms that would otherwise prove impractical to implement in a reasonable amount of time.
First, link to the intel Math Kernel Library and engage all available cores of the processor. Next, download the data and successfully load it into R. The following scripts do this.
library(RevoUtilsMath)
cores <- getMKLthreads()
setMKLthreads(cores)
# Define file and destination paths
file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest <- path.expand("~/Misc/R/Capstone/Swiftkey.zip")
# Download file and unzip
download.file(file, dest)
unzip(dest)
# Load files into R
library(RTextTools)
blog_file <- paste(getwd(),"final", "en_US", "en_US.blogs.txt", sep="/")
news_file <- paste(getwd(),"final", "en_US","en_US.news.txt", sep="/")
twitter_file <- paste(getwd(),"final", "en_US","en_US.twitter.txt", sep="/")
blogs <- readLines(blog_file)
newsl <- readLines(news_file)
twitter <- readLines(twitter_file)
Second, I provide some basic summary statistics. Line counts, word counts, and file size are displayed in a table below for the twitter, blogs, and news file.
| File_Name | Size_.MB.. | Number_of_Lines | Word_Count | |
|---|---|---|---|---|
| 1 | News | 196.28 | 77259 | 2665742 |
| 2 | Blogs | 200.42 | 899288 | 37865888 |
| 3 | 159.36 | 2360148 | 30578891 |
What about the composition of twitter posts, blogs, and news articles? Here are some summary statistics on the number of characters. We can see blogs have the widest distribution of characters, while twitter has the lowest. Bloggers sometimes rant and rave, while twitter users are constrained to a life of 140 character tweets, so both figures make sense.
| Minimum | 1st Qu. | Median | Mean | 3rd Qu. | Maximum | |
|---|---|---|---|---|---|---|
| Blogs | 1.00 | 47.00 | 157.00 | 231.70 | 331.00 | 40840.00 |
| News | 2.00 | 111.00 | 186.00 | 203.00 | 270.00 | 5760.00 |
| 2.00 | 37.00 | 64.00 | 68.80 | 100.00 | 213.00 |
Next, I create a Corpus, removing all white space, numbers, punctuation, and common english words such as “and”, “for”, “in”, “is”, “it”, “not”, “the”, “to” etc. I also transform all words to lower case so that capitalized words don’t result in duplicate counts for the same word. For example, if we don’t transform all words to lower case, than our program will count Models as a distinct word separate from models. These pre-processing steps are import for creating a robust Document Term Matrix.
At the heart of most text mining operations is a Document Term Matrix. R's ability to handle giant matrix computations is one of the things that really sets the language apart from the user friendly, but extremely limited, Microsoft Excel. Once one constructs a Document Term Matrix, one can then count words frequency using the rowSums function. Then, one can use the sort function to rank each word from highest number of appearances to lowest.
Lets view the top 10 words, to get some flavor for the data.
| word | frequency | Percent |
|---|---|---|
| said | 19164.00 | 1.26 |
| will | 8463.00 | 0.56 |
| one | 6387.00 | 0.42 |
| new | 5326.00 | 0.35 |
| also | 4515.00 | 0.30 |
| two | 4430.00 | 0.29 |
| can | 4393.00 | 0.29 |
| year | 4218.00 | 0.28 |
| first | 4147.00 | 0.27 |
| just | 4126.00 | 0.27 |
The word cloud is an excellent visual tool for displaying quantitative features of a corpus in ways that traditional plots do not. We define a color palette that colors word levels based on how many times they appears in our sample corpus. In addition, the size of each word also varies based on its frequency of appearance. I set the minimum frequency of the world cloud to 1000, meaning that each word displayed below appears in the corpus at least 1000 times. The news file of the corpus was used as a sample, and as one can read, “news speak” is on full display with the word said dominating the conversation.
Next, I will explore the suite of predictive analytics models available through the RTextTools package. The models are both powerful and easy to implement. Depending on the results, I will also explore writing custom functions to better interface the Document Term Matrix with the many statistical learning algorithms available through the R statistical computing environment. In summary, not only did I consider the data, but also built a strong computational foundation for optimizing the predictive algorithms we will rely on to successfully complete the project.